Literature DB >> 34238302

Optimizing the maximum reported cluster size in the spatial scan statistic for survival data.

Sujee Lee1, Jisu Moon1, Inkyung Jung2.   

Abstract

BACKGROUND: The spatial scan statistic is a useful tool for cluster detection analysis in geographical disease surveillance. The method requires users to specify the maximum scanning window size or the maximum reported cluster size (MRCS), which is often set to 50% of the total population. It is important to optimize the maximum reported cluster size, keeping the maximum scanning window size at as large as 50% of the total population, to obtain valid and meaningful results.
RESULTS: We developed a measure, a Gini coefficient, to optimize the maximum reported cluster size for the exponential-based spatial scan statistic. The simulation study showed that the proposed method mostly selected the optimal MRCS, similar to the true cluster size. The detection accuracy was higher for the best chosen MRCS than at the default setting. The application of the method to the Korea Community Health Survey data supported that the proposed method can optimize the MRCS in spatial cluster detection analysis for survival data.
CONCLUSIONS: Using the Gini coefficient in the exponential-based spatial scan statistic can be very helpful for reporting more refined and informative clusters for survival data.

Entities:  

Keywords:  Exponential model; Gini coefficient; SaTScan; Spatial cluster detection

Year:  2021        PMID: 34238302      PMCID: PMC8265152          DOI: 10.1186/s12942-021-00286-w

Source DB:  PubMed          Journal:  Int J Health Geogr        ISSN: 1476-072X            Impact factor:   3.918


Background

The spatial scan statistic is a useful and widely used tool for detecting spatial or space–time clusters in disease surveillance. The method has been developed for different types of data such as count [1], ordinal [2, 3], survival [4], continuous [5-7], and multinomial [8]. The software SaTScan™ [9], available for free, enhances the ease of access to this method for researchers. The spatial scan statistic is formulated based on the likelihood ratio test statistic. A large number of scanning windows of various sizes across all locations are first constructed on the entire study area. Each scanning window is a candidate for the most likely cluster. In SaTScan™, circular or elliptical scanning windows are considered. The likelihood ratio test statistic is calculated for each window to compare its inside and outside. The scanning window with the maximum value of the likelihood ratio test statistic is defined as the most likely cluster. Secondary clusters with high test statistic values are also reported. Cluster detection results can be sensitive to the maximum scanning window size (MSWS), as studied by Riberiro and Costa [10]. In SaTScan™, users can specify the MSWS, which is set to 50% of the total population by default. A high MSWS and a high maximum reported cluster size (MRCS) could result in an excessively large cluster. Some researchers try different MSWS values to obtain seemingly good results without knowing the MRCS. Repeatedly performing spatial cluster detection analyses using different values of MSWS leads to a multiple testing problem, as pointed out by Han et al. [11]. We can consider different values of MRCS with a fixed MSWS to avoid this problem. Still, we need to choose the optimal value of the MRCS. The clusters reported by subjectively chosen MRCS may be different from the true clusters. Han et al. [11] proposed a criterion measure to optimize the MRCS for the Poisson-based spatial scan statistic. They defined the Gini coefficient to represent the degree of heterogeneity of disease clusters for count data. Their simulation study showed that the Gini coefficient can be useful for selecting the best MRCS to obtain a refined collection of clusters. Interestingly, by reporting an optimized and refined collection of clusters rather than a single large cluster, the Gini coefficient allows us to better identify irregularly shaped ones [12]. As the formulation of test statistics of the spatial scan statistic is different for different models, the Gini coefficient should be clearly and distinctly defined for each model and thoroughly evaluated. The Gini coefficients for the ordinal- and normal-based spatial scan statistics were proposed by Kim and Jung [13] and by Yoo and Jung [14], respectively. In this paper, we defined the Gini coefficient for the exponential-based spatial scan statistic, which is used for survival data. Through an extensive simulation study under various scenarios, we showed that the proposed method is very useful for optimizing the MRCS for the exponential-based spatial scan statistic. We illustrated the method using Community Health Survey data collected by the Korea Centers for Disease Control and Prevention.

Methods

Poisson model and the Gini coefficient

When we have count data such as the number of certain disease occurrences according to an underlying population at risk in a study region, we can use the Poisson-based spatial scan statistic [1]. We are often interested in identifying areas with high disease incidence rates. The null and alternative hypotheses are written as where p and q are the intensities of the outcome variable inside and outside the scanning window , respectively, and Z denotes the collection of all scanning windows. The likelihood ratio test statistic given window is expressed as if , and otherwise. In the above equation, and denote the observed number of cases and population within window z. and denote the total number of cases and population in the whole study area, respectively. The scanning window that maximizes the value of is the most likely cluster. Statistical inference for the most likely cluster can be performed using Monte Carlo hypothesis testing. In addition, secondary clusters with high values of the likelihood ratio test statistic are often of interest. The p-values of the secondary clusters are typically obtained in the same manner as the null hypothesis is rejected on own strength. When reporting the most likely and secondary clusters, the Gini coefficient can be used to find a more refined collection of non-overlapping clusters. In economics, the Gini coefficient was developed to indicate the degree of heterogeneity of wealth distribution [15]. As a summary measure of the Lorenz curve, the larger the Gini coefficient, the higher the heterogeneity in wealth. Han et al. [11] adopted the Gini coefficient in the spatial scan statistic for count data to measure the degree of heterogeneity in the spatial distribution of disease cases by defining the x-axis of the Lorenz curve as the cumulative proportion of the number of disease cases and the y-axis as the cumulative proportion of the population. Its value is calculated as twice the area between the Lorenz curve and the 45° line, which indicates that the number of cases is proportional to the population of each region. When there is only one significant cluster, the Lorenz curve is constructed as a line graph connecting the three points (0,0), (), and (1,1), where and are the proportions of observed cases and population (expected cases) in the cluster. As more cases are concentrated in the cluster than expected, increases and the Lorenz curve moves farther away from the reference line. The Gini coefficient also increases. When we have K multiple clusters, the Lorenz curve connects K points between (0,0) and (1,1). The coordinates of each cluster are defined as and where and are the number of cases and population in the -th cluster. The Gini coefficient can be calculated as with and The Gini coefficient values range from 0 to 1. We select the best collection of clusters to report the highest Gini coefficient value from among several competing sets of clusters. Han et al. [11] included more detailed information. The Gini coefficient has been implemented in SaTScan™ for the Poisson and Bernoulli models.

Spatial scan statistic for survival data

Different spatial scan statistics for survival data have been proposed based on different models, including Weibull and generalized life distributions [16, 17]. Huang et al. [4] proposed a spatial scan statistic for survival data based on an exponential model. We focused on the exponential model. The exponential-based spatial scan statistic has been used to examine geographic disparities in survival in cancer patients [18-20]. Suppose we have survival data for I subjects in a study area, such as time to death for cancer patients. Let and be the survival time and fixed censoring time for the th subject, respectively. We assume that is exponentially distributed with a probability density function  Parameter represents mean survival time. The observed time Let be the censoring indicator, that is, and To identify clusters of short survival, the null and alternative hypotheses are written as: where denotes the mean survival time for subjects within zone , and is the mean survival time for subjects outside zone . The exponential-based spatial scan statistic is defined aswhere and (the number of non-censored subjects inside and outside zone , respectively). The total number of non-censored subjects in the entire study area is denoted by When there are no censored observations, and are replaced by the total number of subjects inside and outside zone , and , respectively, with by in the above test statistic. When searching for clusters of short survival time using SaTScan™, users can specify the maximum size for z. The default setting is 50% of the total population. When the size of the most likely cluster is very large, one may want to know if smaller clusters that are statistically significant are contained in the large cluster. We can try different values for the maximum reported cluster size (MRCS), not the maximum scanning window size (MSWS). The MRCS is also set to 50% of the total population by default. It is not clear how to select the best MRCS for the exponential model. In the next section, we propose a Gini coefficient to optimize the MRCS for the exponential model.

Gini coefficient for exponential model

To measure the disproportion of survival in each area, the Lorenz curve can be defined using the number of subjects and the sum of survival times. We define the x-axis as the cumulative proportion of the number of non-censored subjects and the y-axis as the cumulative proportion of the sum of observed times. If there is only one significant cluster the Lorenz curve is constructed in the same way as that of the Poisson model. Specifically, the x- and y-coordinates of point P for the cluster are defined as: and Considering the maximum likelihood estimates for the parameter of the exponential distribution under the null and alternative hypotheses, that is, and , the cumulative proportion of the sum of the observed times would be proportional to the cumulative proportion of non-censored subjects in each region under the null hypothesis of no clusters. If there is a significant cluster of short survival, the proportion of the sum of observed times in the cluster to that in the whole study region would be less than the proportion of the number of subjects. As the sum of the observed times in the cluster decreases, the y-coordinate decreases and the Lorenz curve moves farther away from the reference line. Then, the value of the Gini coefficient, which is twice the area between the Lorenz curve and the reference line, increases. When there are K clusters  (ordered by their statistical significance), the coordinates of each cluster are defined as and . The Lorenz curve connects K points of , and the Gini coefficient is calculated in the same way as with and Different values for the MRCS produces different sets of clusters with different values of the Gini coefficient. We can select the optimal collection of clusters with the highest dissimilarity in survival based on the Gini coefficient.

Simulation study

We conducted a simulation study to evaluate the performance of the Gini coefficient in the exponential model. We used six cluster models in Seoul and Gyeonggi Province in South Korea as the whole study region. True clusters of different shapes and sizes are assumed in the study region, consisting of 67 districts, as shown in Fig. 1. Since circular and elliptical windows are available in SaTScan™, we mainly considered these two shapes. We also included an irregularly shaped cluster to examine whether the proposed method could possibly work better in identifying irregular clusters than the default setting. Cluster models A and B assumed a circular true cluster of 10% (6 districts) and 30% (20 districts) of the entire study region, respectively. Cluster model C included two adjacent circular clusters, each of which accounts for 10% (6 districts). Models D and E consisted of elliptical clusters of 10% (6 districts) and 30% (20 districts). Model F included an irregularly shaped cluster of 20% (13 districts). For each model, we considered 12 scenarios for the combination of mean survival time and censoring rate. We varied the mean survival time for the true clusters as 2, 5, and 7, compared to 10 for areas outside the clusters. We adopted the parameter setting for the mean survival time from the study by Huang et al. [4]. The censoring rates were set to 10%, 30%, 50%, and 70% to examine how the performance of the proposed method can be affected by the censoring rate.
Fig. 1

Cluster models used in the simulation. A one circular cluster of 10%, B one circular cluster of 30%, C two circular clusters of 10% each, D one elliptical cluster of 10%, E one elliptical cluster of 30%, F one irregular cluster of 20%

Cluster models used in the simulation. A one circular cluster of 10%, B one circular cluster of 30%, C two circular clusters of 10% each, D one elliptical cluster of 10%, E one elliptical cluster of 30%, F one irregular cluster of 20% We generated 1,000 subjects and randomly assigned them to one of the 67 districts in the study region under each scenario. If a subject was in the districts of the true cluster, the survival time was generated from an exponential distribution with a mean of 2, 5, and 7. Otherwise, the survival time was generated from an exponential distribution with a mean of 10. We censored the survival time for randomly selected subjects out of the 1,000 subjects at a chosen censoring rate. We then searched for clusters with short survival using circular and elliptical scanning windows, with 15 MRCS values of 3%, 4%, 5%, 6%, 8%, 10%, 12%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, and 50% in the SaTScan™ software. Using these numbers can be thought of as a grid search. These candidate MRCS values are used for the Poisson and Bernoulli models in SaTScan™ and were used for consistency with the exponential model. We selected these numbers to be consistent for the exponential model as used in the Poisson and Bernoulli models in SaTScan™. The MSWS was fixed at 50%. The Gini coefficient was calculated for each MRCS value. We selected the optimal MRCS with the highest Gini coefficient. The reported clusters were then compared with the true clusters. We repeated the simulation 1,000 times for each scenario. We counted the number of times the Gini coefficient selected each of the 15 MRCS values as the optimal. The performance of the proposed method was summarized using the sensitivity and positive predicted value (PPV). In the context of spatial cluster detection, sensitivity is the proportion of districts correctly detected among the districts in the true cluster, and PPV is the proportion of districts correctly detected among the districts in the detected cluster. Higher values of these measures indicate more accurate detection. Specifically, the sensitivity and PPV were estimated from 1,000 datasets as where is the number of rejected datasets. We also calculated the accuracy measures under the default MRCS setting of 50% in SaTScan™.

Korea community health survey data

To illustrate the proposed method, we used data from the Korea Community Health Survey (KCHS) conducted by the Korea Centers for Disease Control and Prevention [21]. This community-based cross-sectional survey has been conducted at 253 community health centres annually since 2008. The survey questionnaire includes topics related to health behaviour and prevention. We used the age of first drinking for males as the survival time in the 2017 survey data. If a person had never had a drink, his survival time was censored at the age of the survey. The location information of each individual was available at the district level because each district in Korea has approximately one community health centre. In Seoul and Gyeonggi province, we searched for areas with low mean age of first drinking (i.e. spatial clusters of short survival time) using the exponential-based spatial scan statistic with both circular and elliptical scanning windows. The reported clusters selected optimally by the proposed method were compared with those at the default setting in SaTScan™.

Results

Simulation study results

Here, we have presented only a subset of all the simulation results. The other results are included in Additional file 1. Tables 1 and 2 show that the Gini coefficient most often selected the optimal MRCS as the same as the size of the true cluster using circular or elliptical windows when the true cluster was circular with a mean survival time of 5, regardless of the censoring rate. The detection accuracy was very high for the most frequently chosen MRCS. Both the sensitivity and PPV were above 0.95, which is higher than those at the default setting in most cases. The difference in the detection accuracy between the most often chosen MRCS and the default setting was larger when the true cluster was smaller (10%). The difference in PPV was even more pronounced. When the true cluster was medium sized (30%), the PPV was higher in every case at the most often chosen MRCS, while the sensitivity was slightly higher or similar. These results indicate that the spatial scan statistic without optimizing the MRCS tends to report a larger cluster than the true cluster, especially when the true cluster is small. A lower PPV implies that the detected cluster is larger because the number of detected clusters is in the denominator when calculating the PPV. We also summarized the overall detection accuracy when using the Gini coefficient over all the chosen MRCSs. Still, the sensitivity and PPV were higher than or similar to those at the default setting.
Table 1

Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 5

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency2515224865711076231584211989
Sensitivity0.0130.3330.6000.5760.6350.9720.9440.9830.9780.9780.9791.0001.0001.0001.0000.9340.928
PPV0.5001.0000.9310.9570.8850.9890.8140.6700.5030.4280.3220.2860.2350.2220.1880.9090.903
30%Frequency056322754016275773463311972
Sensitivity0.3670.3890.5630.5990.9570.9370.9600.9610.9951.0001.0001.0001.00010.9230.906
PPV0.9500.9110.9740.7830.9680.8120.7000.5240.4300.3390.2710.2340.2140.1880.8520.845
50%Frequency09012016367209159563287120986
Sensitivity0.333-0.5040.4900.9720.8960.9840.9460.9691.0001.0001.0000.917-0.8860.875
PPV1.000-0.9910.7160.9770.7740.7110.4790.4190.3270.3010.2310.165-0.8300.824
70%Frequency0205180531234583129631612918
Sensitivity0.3330.3330.500-0.9700.8280.6670.9790.8440.9941.0000.8890.9690.9860.9020.831
PPV1.00010.986-0.9970.7080.5000.5220.3490.3190.2690.2190.2080.1910.8410.837
Elliptical window10%Frequency1491774430241124462255412985
Sensitivity0.0100.4170.4820.5100.7010.9570.9740.9890.9891.0001.0000.9671.0001.0001.0000.9340.920
PPV1.00010.9631.0000.9620.9690.8230.6890.5220.3990.3260.2790.2380.2070.1850.8530.847
30%Frequency031281182482471798543123202971
Sensitivity0.3330.3330.5000.6750.9400.9420.9900.9730.9960.9861.0000.917-1.0000.9080.886
PPV11.0000.9880.9680.9490.7990.6900.5380.4110.3310.2870.220-0.1640.7940.788
50%Frequency263601221592582117754183022977
Sensitivity0.0270.3890.5560.5000.6690.9570.9130.9820.9570.9880.9821.000-1.0001.0000.8830.865
PPV1.0001.0000.9330.9920.9860.9550.7910.6960.5220.4110.3210.288-0.1820.1850.7810.779
70%Frequency04053184212169859130252020713913
Sensitivity0.333-0.5000.6880.9700.8710.7940.8390.8610.8531.0000.9420.9760.9870.8290.758
PPV1.000-1.0000.9870.9860.7700.5350.4600.3400.2750.2800.2270.2070.1820.7620.761

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Table 2

Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 5

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency0000011112279216616001000
Sensitivity0.8500.9000.8000.7000.7930.9850.9960.9940.9820.982
PPV0.8950.7831.0000.6671.0001.0000.9280.8200.9840.985
30%Frequency000000214616782522001000
Sensitivity0.7250.7000.5130.760.9750.9990.9250.9660.966
PPV1.0000.7781.0000.9860.9990.9350.7710.9810.982
50%Frequency000000304428001492001000
Sensitivity0.733-0.5630.8100.9761.0000.8750.9700.970
PPV1.000-1.0000.9980.9990.9240.7160.9870.987
70%Frequency000000001885111391801000
Sensitivity0.4500.7440.9090.9890.9780.9890.9180.918
PPV1.0001.0000.9970.9120.7780.6990.9800.980
Elliptical window10%Frequency0000001432475717631311000
Sensitivity0.7000.8750.7170.7960.9810.9880.9871.0001.0000.9770.977
PPV0.8750.8250.9550.9950.9970.9210.7960.7320.6060.9750.976
30%Frequency00000020139061926212111000
Sensitivity0.750-0.6270.7610.9700.9920.9831.0001.0000.9520.953
PPV0.917-1.0000.9910.9970.9260.7660.7410.6060.9740.973
50%Frequency0000001067475015610211000
Sensitivity0.750-0.6330.7640.9670.9900.9701.0001.0000.9540.953
PPV1.000-1.0000.9950.9950.9160.7650.7420.6250.9800.979
70%Frequency00000001315666156147841000
Sensitivity0.9000.5830.7530.9070.9560.9930.9691.0000.9250.925
PPV0.9001.0001.0000.9970.8760.8120.6860.6160.9470.946

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 5 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 5 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics In the case of two true clusters, which are close to each other, the proposed method often chose a slightly smaller MRCS than that of the true cluster. However, the PPV was always higher than that at the default setting, although the sensitivity was slightly lower only when the mean survival time in the true clusters was 5. This result again implied that the default setting reported rather a larger cluster than the true clusters. When the mean survival time was 7 in the true clusters, the frequency of chosen MRCS was spread over all possible MRCSs (Table 3). This might be attributable to the low detection power due to the small difference in mean survival time inside vs. outside the clusters. The promising indication here is that the overall sensitivity is much higher when using the Gini coefficient than without it.
Table 3

Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 7

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency1812111710348466519122039533926528
Sensitivity0.0050.1740.2880.2840.4140.4640.4780.5440.5130.5970.7710.8570.9430.9530.9710.5880.313
PPV0.9820.8750.9470.9310.9710.8190.7240.7110.5470.4840.4950.4680.4510.3930.3610.6940.690
30%Frequency307151511446558228341726182317527
Sensitivity0.0070.1670.2440.2560.3820.4090.4440.4980.5060.6740.7500.8400.9170.9710.9800.5000.267
PPV0.9671.0000.7780.9670.9860.7990.7360.7240.5250.5250.4850.4550.4390.3930.3600.7340.734
50%Frequency4741788224445826491123272817465
Sensitivity0.0110.1670.2450.2500.3830.4380.4510.4710.4580.6920.7350.8440.8740.9110.9270.5050.242
PPV1.0001.0000.8780.8330.9370.8240.7330.6870.4870.5350.4850.4530.4310.3810.3430.6970.692
70%Frequency77215328146514710232735325591
Sensitivity0.0160.1670.2500.2500.3560.3790.3830.3990.5000.6580.7570.9040.7780.8490.8270.4240.258
PPV0.9871.0000.7500.6000.7640.6970.5820.5060.4820.4920.4860.4770.3780.3570.3210.6910.691
Elliptical window10%Frequency41414177343543948555539392433551
Sensitivity0.0010.1960.2680.3040.3890.4460.4350.5320.5470.8110.8610.9490.9490.9510.9750.6360.350
PPV1.0000.9050.8690.9850.9080.8260.7090.6920.5740.6230.5730.5300.4600.3880.3540.6640.662
30%Frequency14236228486526443503420222515560
Sensitivity0.0040.2170.2360.2990.3670.4200.4310.4750.5740.7770.8510.9000.9050.9600.9830.5390.304
PPV1.0001.0000.9440.9670.9360.8710.7510.6660.5810.6090.5480.4910.4390.4000.3590.7290.729
50%Frequency27269205459397846502322202533531
Sensitivity0.0070.2440.2220.3000.3600.4120.4360.4730.5330.7170.7100.8450.9000.9370.9270.5260.285
PPV1.0001.0000.8610.9050.8870.8020.7400.6470.5420.5700.4400.4650.4370.3890.3390.6730.670
70%Frequency3423254154122171535143237402049648
Sensitivity0.0100.2210.2080.3230.3290.3360.3430.3830.5210.6370.7790.8870.8920.8830.9590.4810.316
PPV1.0001.0000.7080.9730.7620.5940.5290.4980.5360.4780.4920.4840.4320.3720.3570.6460.645

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 7 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics In the case of elliptical clusters, the Gini coefficient with elliptical scanning windows most often picked the best MRCS of the same size as the true cluster when the mean survival time was 5 inside the true cluster (Tables 4 and 5). When the cluster was small (10%), the detection accuracy at the most often chosen MRCS was much higher than that at the default setting. When the mean survival time was 2 inside the true cluster, similar patterns were observed. The Gini coefficient with circular scanning windows most often selected a smaller MRCS than the true cluster size. Still, the overall sensitivity and PPV at the most often chosen MRCS were higher than those at the default setting. When the mean survival time was 7 inside the true cluster, the overall detection accuracy was higher than that at the default setting.
Table 4

Simulation results for cluster model D (one elliptical cluster, 10% of total area) with a mean survival time of 5

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency1438196256822041845736172065845
Sensitivity0.0170.2780.5000.6690.6500.6890.7000.7970.8630.9740.9860.9710.9921.0001.0000.7300.617
PPV1.0000.7780.8000.9700.8910.6970.5530.5090.4610.4100.3280.2700.2390.2000.1720.7280.726
30%Frequency63102403659273563441111452898
Sensitivity0.0110.3330.6170.6530.6550.7140.7140.8050.8330.9200.9851.0000.9580.9671.0000.6990.629
PPV0.8331.0000.8600.9890.9100.6780.5680.5010.4530.3850.3270.2700.2120.1920.1820.8070.802
50%Frequency3088741549737908491616719847
Sensitivity0.0030.6670.6340.6590.6600.6190.7660.7820.8650.9260.9790.9900.9290.9740.7160.612
PPV0.6670.7670.9660.8670.6940.5310.5030.4250.3530.2960.2700.2270.1960.1760.6980.691
70%Frequency002251429172532993257740496
Sensitivity-0.5000.6400.6120.6480.3330.7620.8070.8230.9780.9950.9670.9030.8830.8070.426
PPV-1.0000.9600.7680.7070.2860.4910.4440.3250.3290.2650.2280.1910.1710.4640.437
Elliptical window10%Frequency1514286742614010164341511842920
Sensitivity0.0040.3000.4880.6670.7040.9740.9730.9650.9610.9760.9780.9850.9581.0001.0000.9310.857
PPV1.0000.8000.9760.9660.9450.9530.8170.6570.5250.4040.3270.2780.2290.2050.1910.8200.820
30%Frequency229576846016096521783322941
Sensitivity0.0060.3330.5560.6700.6960.9590.9590.9530.9100.9510.9580.9440.9440.9171.0000.9120.858
PPV0.5001.0000.9440.9940.9730.9370.8190.6630.4930.4050.3050.2570.2260.1740.1880.8470.848
50%Frequency036354340248134543711612412924
Sensitivity0.2220.4720.7220.7350.9600.9620.9690.8800.8960.8640.9720.9580.9580.9720.9340.864
PPV0.6670.9440.8750.9440.9390.8250.6740.4730.3690.2790.2650.2300.1870.1820.7850.783
70%Frequency011213105301183821413930602034733
Sensitivity0.3330.4720.5000.7300.9760.7590.8860.7460.8500.8970.9560.9860.9420.9660.8930.657
PPV1.0000.9441.0000.9810.9240.6670.5940.4160.3490.2900.2650.2410.2040.1820.6920.690

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Table 5

Simulation results for cluster model E (one elliptical cluster, 30% of total area) with a mean survival time of 5

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency00003249212862442033424915291000
Sensitivity0.5000.6690.7090.7900.7640.7550.8300.8850.8960.8900.9620.8410.810
PPV0.9260.9970.9940.9780.8900.8490.7810.7760.7570.6390.5880.8260.820
30%Frequency0001818155209839120257144591000
Sensitivity0.6000.5560.5360.6940.7360.6920.6650.7600.8820.8960.8900.9330.7780.725
PPV0.9230.9831.0000.9920.9730.9180.8530.7650.7750.7680.6750.5770.8720.863
50%Frequency0000122691247194163951933121000
Sensitivity0.3000.6270.6120.6650.5630.6320.7470.8800.9000.9000.9130.7850.765
PPV1.0001.0000.9920.9700.9600.8540.8090.7780.7680.6450.5570.8370.836
70%Frequency3000468184645156803292511000
Sensitivity0.0670.2000.4250.5060.6970.5510.6720.7700.8780.8920.8720.9130.8400.836
PPV1.0001.0001.0000.9350.9410.9580.8150.7580.7650.7040.6240.5530.7580.758
Elliptical window10%Frequency000002115242724282731011811000
Sensitivity0.6250.6960.8670.7550.7650.8940.9510.9590.9921.0000.8990.896
PPV0.9620.9660.9300.9310.9930.9800.8910.7950.7090.6060.9280.933
30%Frequency00001122496313754313835921000
Sensitivity0.7000.5500.7000.8160.6520.7200.8930.9360.9540.9781.0000.8540.853
PPV1.0001.0000.9760.9270.9620.9800.9830.8810.7910.6800.6160.9540.956
50%Frequency00000314168121512170602131000
Sensitivity0.6170.7000.7620.6240.7070.8940.9310.9500.9831.0000.8580.858
PPV1.0000.9330.9420.9810.9660.9850.8560.7880.6980.6190.9400.943
70%Frequency000011310194937112626085751000
Sensitivity0.2000.6500.6330.6250.6290.7220.8730.9010.9520.9770.9950.8990.899
PPV1.0001.0000.9210.8820.9960.9210.9630.8010.7550.6930.6170.8370.838

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Simulation results for cluster model D (one elliptical cluster, 10% of total area) with a mean survival time of 5 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics Simulation results for cluster model E (one elliptical cluster, 30% of total area) with a mean survival time of 5 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics When the true cluster was irregularly shaped, the proposed method seemed to choose smaller sizes of MRCS than the true cluster size. However, the overall sensitivity was always higher than that at the default setting. When the mean survival time was 7 in the true cluster, the difference in performance was clearer (Table 6). This might be because refined sets of smaller clusters were reported by the Gini coefficient rather than a single larger cluster.
Table 6

Simulation results for cluster model F (one irregular cluster, 20% of total area) with a mean survival time of 7

% of censMaximum reported cluster size (MRCS)Default Setting
3%4%5%6%8%10%12%15%20%25%30%35%40%45%50%Overall
Circular window10%Frequency20613276937573833451720373245496
Sensitivity0.0050.1410.2660.2790.3600.3830.4180.4900.5430.6560.6970.7920.9630.9660.9780.5630.285
PPV0.9500.7500.9300.9510.9660.7880.7190.6730.5670.5680.4600.4590.4810.4340.3940.6740.663
30%Frequency20215125321703349551117431940460
Sensitivity0.0040.1540.2260.2820.3730.3740.4090.5010.5840.6410.7690.8100.9640.9310.9830.5750.267
PPV0.9501.0000.8780.9790.9730.7850.7750.6780.5820.5570.5110.4820.4840.4110.3980.6700.663
50%Frequency41136273111135305039315591526519
Sensitivity0.0080.1540.2200.2710.3600.3850.4040.4740.5000.6020.6670.8210.9790.9030.9500.4980.263
PPV1.0001.0000.9400.9380.9670.7550.7540.6370.5570.5340.4530.4880.4910.4110.3990.7000.698
70%Frequency66016420430106232437111031109662
Sensitivity0.0150.2260.3080.3310.3820.4110.4520.5830.6050.6150.7690.9230.8490.9240.4730.317
PPV0.9850.7970.7640.7870.7360.7430.6070.6220.4960.4210.4550.4680.3940.3990.6810.681
Elliptical window10%Frequency46141631496071104803949412926619
Sensitivity0.0010.1540.2470.3130.3470.4240.5120.5740.6730.7790.8620.9060.9340.9420.9650.6680.414
PPV1.0000.8330.9290.9720.9660.8990.8790.8160.7510.6590.5910.5210.4770.4210.3830.7160.714
30%Frequency95591434627279506360422433561
Sensitivity0.0020.2310.2310.3250.3520.4590.5290.5850.6310.8000.8650.9190.9510.9520.9740.7040.396
PPV0.9441.0000.9331.0000.9540.9510.9320.8350.7160.6660.5940.5380.4980.4310.3920.7050.703
50%Frequency30121082226657493896950231824613
Sensitivity0.0090.2240.2230.3170.3320.3960.5010.5770.6630.7920.8450.9060.9530.9320.9680.6500.402
PPV1.0001.0000.9330.8880.8990.8830.9040.8130.7580.6670.5780.5410.4900.4260.3950.7260.727
70%Frequency126515842102910024732341782918846
Sensitivity0.0070.2050.2770.3030.3620.4580.4800.6000.6850.7620.6970.8100.8170.9180.9320.5780.490
PPV1.0001.0001.0000.9110.9260.8560.8460.8690.8100.6540.4950.4840.4340.4220.3950.7970.797

% of cens, percentage of censoring; PPV, positive predictive value

Cells most often selected as the optimal MRCS are shown in italics

Simulation results for cluster model F (one irregular cluster, 20% of total area) with a mean survival time of 7 % of cens, percentage of censoring; PPV, positive predictive value Cells most often selected as the optimal MRCS are shown in italics

KCHS data analysis results

When using circular windows, the proposed method selected the default setting of 50% as the optimal MRCS. The most likely cluster was quite large, including 31 districts, as shown in Fig. 2(a). A small secondary cluster consisting of three districts was also detected. When using elliptical windows, the proposed method selected 25% as the optimal MRCS. The detected clusters were slightly different from those at the default setting. Information on the detected clusters is presented in Table 7. A single large cluster consisting of 26 districts was detected at the default setting (Fig. 2(c)), while two smaller clusters were detected using the Gini coefficients (Fig. 2(b)). Cluster 1 in Fig. 2(b) is part of cluster 1 in Fig. 2(c). Some districts of cluster 2 in Fig. 2(b) overlapped with cluster 1 in Fig. 2(c), but the other districts were not included in the cluster in Fig. 2(c). The test statistic value for the cluster in Fig. 2(c) was much larger than that for cluster 1 in Fig. 2(b). However, the mean survival time of cluster 1 in Fig. 2(b) was lower than that of cluster 1 in Fig. 2(c). It is likely that the default setting detected a larger cluster by including unnecessary neighbouring districts. Although the mean survival time of cluster 2 in Fig. 2(b) was higher than that of cluster 1 in Fig. 2(c), it was still lower than that outside the clusters and is statistically significant. The clusters at the optimal MRCS chosen by the Gini coefficient in Fig. 2(b) appear to be more meaningful than cluster 1 in Fig. 2(c).
Fig. 2

Spatial clusters with low mean age of first drinking in Seoul and Gyeonggi province using 2017 KCHS data. a circular windows, Gini or default (50%), b elliptical windows, Gini (25%), c elliptical windows, default (50%)

Table 7

Cluster detection results for 2017 KCHS data using elliptical windows with the Gini coefficient and default setting for MRCS

ClusterDistrictsaLLRp-valueMean survival timeObservationsaNon-censored
Gini (25%)11626.730.00121.3465846313
2159.880.00122.1070736706
Default12647.120.00121.5111,27110,781

aDistricts- number of districts; LLR log-likelihood ratio; aObservations- number of observations; aNon-censored- number of non-censored observations

Spatial clusters with low mean age of first drinking in Seoul and Gyeonggi province using 2017 KCHS data. a circular windows, Gini or default (50%), b elliptical windows, Gini (25%), c elliptical windows, default (50%) Cluster detection results for 2017 KCHS data using elliptical windows with the Gini coefficient and default setting for MRCS aDistricts- number of districts; LLR log-likelihood ratio; aObservations- number of observations; aNon-censored- number of non-censored observations

Discussion and conclusion

We have proposed the Gini coefficient in the exponential-based spatial scan statistic to optimize the MRCS in cluster detection analysis for survival data. The proposed method was defined to measure the degree of heterogeneity in the mean survival times of clusters. Our simulation study showed that the Gini coefficient mostly selected the optimal MRCS, similar to the true cluster size. The detection accuracy was higher for the best chosen MRCS than at the default setting. A lower PPV at the default setting indicates that using the default value of 50% of the total population for the MSWS and MRCS tends to produce a larger cluster that hides smaller clusters and includes non-informative areas. Even though the Gini coefficient did not always select the optimal MRCS the same as the true cluster size, the overall detection accuracy when using the Gini coefficient was generally improved compared to when it was not used. This improvement was greatly noticeable in some cases. The application of the proposed method to the KCHS data supported that the proposed method can optimize the MRCS in spatial cluster detection analysis for survival data. We searched for a cluster with a short survival time. The most likely cluster at the default setting was rather larger with a higher mean survival time than that at the optimal MRCS chosen by the Gini coefficient. Interestingly, the two clusters at the optimal MRCS were contiguous and formed an irregularly shaped cluster. As reported by Kim and Jung [12], the Gini coefficient might also be useful for detecting irregularly shaped clusters in the exponential model. Here, we again emphasize that we optimize the MRCS using the Gini coefficient, not the MSWS. Rerunning the analyses with different MSWSs should be avoided because of the multiple testing problem. Wang et al. [22] presented their proposed method, called the maximum clustering heterogeneous set proportion, as an indicator to select the MSWS. As they described, different MSWSs lead to different sets of windows and then different detected clusters. Thus, even the same cluster under different sets of windows can have different p-values. It is incorrect to choose the result with the smallest p-value because it is not appropriately adjusted for multiple testing. Trying different values of MRCS to select clusters for reporting is the correct way to do this. The Gini coefficient was first developed for the Poisson and Bernoulli models and subsequently adopted for the ordinal and normal-based models. The Gini coefficient for the exponential model in this study was also specifically defined for the specific probability model and thoroughly evaluated. The option to optimize the MRCS using the Gini coefficient in SaTScan™ is available only for the Poisson and Bernoulli models. It is easy to implement the Gini coefficient in the exponential model using R with the ‘rsatscan’ package[23]. An R function to calculate the Gini coefficient is available upon request. Using the spatial scan statistic with the default setting has been criticized because the detected most likely cluster may be much larger than the true clusters as they might include irrelevant neighbouring areas [24-27]. Studies that proposed the Gini coefficient for the Poisson, Bernoulli, ordinal, and normal models revealed that using the Gini coefficient in spatial scan statistics can resolve this problem to a certain extent [11, 13, 14]. Using the Gini coefficient for the Poisson model can also be effective in detecting irregularly shaped clusters [12]. The exponential model can be used for spatial cluster detection analysis of time-to-event type data such as cancer survival, time to disease recurrence, or age at first smoking, with or without censoring. We believe that using the Gini coefficient in the exponential-based spatial scan statistic can be very helpful for reporting more refined and informative clusters for survival data. Addtional file 1: Table A1. Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 2. Table A2. Simulation results for cluster model A (one circular cluster, 10% of total area) with a mean survival time of 7. Table A3. Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 2. Table A4. Simulation results for cluster model B (one circular cluster, 30% of total area) with a mean survival time of 7. Table A5. Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 2. Table A6. Simulation results for cluster model C (two circular clusters, 10% each of total area) with a mean survival time of 5. Table A7. Simulation results for cluster model D (one elliptic cluster, 10% of total area) with a mean survival time of 2. Table A8. Simulation results for cluster model D (one elliptic cluster, 10% of total area) with a mean survival time of 7. Table A9. Simulation results for cluster model E (one elliptic cluster, 30% of total area) with a mean survival time of 2. Table A10. Simulation results for cluster model E (one elliptic cluster, 30% of total area) with a mean survival time of 7.
  18 in total

1.  A test for spatial disease clustering adjusted for multiple testing.

Authors:  T Tango
Journal:  Stat Med       Date:  2000-01-30       Impact factor: 2.373

2.  A spatial scan statistic for ordinal data.

Authors:  Inkyung Jung; Martin Kulldorff; Ann C Klassen
Journal:  Stat Med       Date:  2007-03-30       Impact factor: 2.373

3.  Spatial scan statistics can be dangerous.

Authors:  Toshiro Tango
Journal:  Stat Methods Med Res       Date:  2021-01       Impact factor: 3.021

4.  Spatial cluster detection for ordinal outcome data.

Authors:  Inkyung Jung; Hana Lee
Journal:  Stat Med       Date:  2012-07-17       Impact factor: 2.373

5.  Korea Community Health Survey Data Profiles.

Authors:  Yang Wha Kang; Yun Sil Ko; Yoo Jin Kim; Kyoung Mi Sung; Hyo Jin Kim; Hyung Yun Choi; Changhyun Sung; Eunkyeong Jeong
Journal:  Osong Public Health Res Perspect       Date:  2015-06-10

6.  Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics.

Authors:  Junhee Han; Li Zhu; Martin Kulldorff; Scott Hostovich; David G Stinchcomb; Zaria Tatalovich; Denise Riedel Lewis; Eric J Feuer
Journal:  Int J Health Geogr       Date:  2016-08-03       Impact factor: 3.918

7.  A nonparametric spatial scan statistic for continuous data.

Authors:  Inkyung Jung; Ho Jin Cho
Journal:  Int J Health Geogr       Date:  2015-10-20       Impact factor: 3.918

8.  Evaluation of the Gini Coefficient in Spatial Scan Statistics for Detecting Irregularly Shaped Clusters.

Authors:  Jiyu Kim; Inkyung Jung
Journal:  PLoS One       Date:  2017-01-27       Impact factor: 3.240

9.  Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data.

Authors:  Sehwi Kim; Inkyung Jung
Journal:  PLoS One       Date:  2017-07-28       Impact factor: 3.240

10.  Geographic disparities in colorectal cancer survival.

Authors:  Kevin A Henry; Xiaoling Niu; Francis P Boscoe
Journal:  Int J Health Geogr       Date:  2009-07-23       Impact factor: 3.918

View more
  1 in total

1.  Spatiotemporal Analysis and Risk Assessment Model Research of Diabetes among People over 45 Years Old in China.

Authors:  Zhenyi Wang; Wen Dong; Kun Yang
Journal:  Int J Environ Res Public Health       Date:  2022-08-10       Impact factor: 4.614

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.