Literature DB >> 31419252

Spatial scan statistics for matched case-control data.

Abstract

Spatial scan statistics are widely used for cluster detection analysis in geographical disease surveillance. While this method has been developed for various types of data such as binary, count, and continuous data, spatial scan statistics for matched case-control data, which often arise in spatial epidemiology, have not been considered. We propose spatial scan statistics for matched case-control data. The proposed test statistics consider the correlations between matched pairs. We evaluate the statistical power and cluster detection accuracy of the proposed methods through simulations compared to the Bernoulli-based method. We illustrate the proposed methods using a real data example. The simulation study clearly revealed that the proposed methods had higher power and higher accuracy for detecting spatial clusters for matched case-control data than the Bernoulli-based spatial scan statistic. The cluster detection result of the real data example also appeared to reflect a higher power of the proposed methods. The proposed methods are very useful for spatial cluster detection for matched case-control data.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 31419252 PMCID： PMC6697355 DOI： 10.1371/journal.pone.0221225

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Spatial cluster detection is an important problem in spatial epidemiology. Among the various statistical methods available, the spatial scan statistic [1] is one of the most widely used methods. Application of this method is not limited to geographical disease surveillance, but to various areas, including criminology [2,3], entomology [4], and urban planning [5,6]. The spatial scan statistic is defined as the maximum of likelihood ratio test statistics over a collection of scanning windows. Numerous scanning windows are constructed on an entire study region and each is a candidate for the most likely cluster. The likelihood ratio test statistic for comparing the inside versus the outside of a window is formulated based on the data type to be analyzed. Different probability models for the spatial scan statistic have been proposed and extensively used such as Poisson [1], Bernoulli [1], ordinal [7], multinomial [8], normal [9], and exponential [10]. The freely available software SaTScan [11] can be used for the probability models mentioned above. In epidemiology, one frequently used retrospective observational study design is a case-control study, in which cases with an outcome of interest are identified and a comparable control group is sampled. Further, controls matching each case can be selected to control for confounding variables. In some studies, cases and controls could be matched by their location as well. When we are interested in spatial variation between cases and controls, however, we should not consider their location information as a confounding variable. Matched case-control data require a specific form of analysis to consider dependency in responses within a matched pair. However, there is no spatial scan statistic for matched case-control data. Some studies have used the Bernoulli-based spatial scan statistic [12, 13]. Because the method was developed for independent binary outcome data, it may be inappropriate to apply the Bernoulli-based method to matched case-control data. In this paper, we propose two spatial scan statistics for matched case-control data. The test statistics are constructed based on McNemar’s test statistic and Wald type test statistic for an odds ratio. In the next section, we briefly review the Bernoulli-based spatial scan statistic and then present the proposed methods. Through a simulation study, we evaluated the performance of the proposed methods compared to that of the Bernoulli-based method in terms of statistical power and detection accuracy. We illustrate the proposed methods using a real data example of male lung cancer cases with matched controls in Seoul, Korea. We provide conclusions and discussion in the final section.

Methods

Spatial scan statistic for binary data

For binary outcome data such as cases and non-cases of certain diseases, we can use the Bernoulli-based spatial scan statistic. The null and alternative hypotheses are written as where p and q are the probability of being a case inside and outside the scanning window z, respectively, and Z denotes the collection of all scanning windows. Scanning windows are constructed at every location with varying sizes by including the nearest neighbor one by one, up to certain limit. Usually 50% of total population is set as the maximum value of scanning window size. Given window z, the test statistic is expressed as where c and n denote the number of cases and observations within z, respectively, and C and N are the total number of cases and observations over the whole study area, respectively. I() is an indicator function to indicate the high or low rate. To search for a cluster with a low rate (H: p The scanning window associated with the maximum value of LR(z) is defined as the most likely cluster. The Monte Carlo hypothesis testing is the standard method for obtaining a p-value for the most likely cluster. In addition to the most likely cluster, we often report secondary clusters with high values of LR(z). The p-values of secondary clusters are typically obtained in the same manner. The Bernoulli-based spatial scan statistic is available on SaTScan.

Spatial scan statistics for matched case-control data

When we have binary outcome data from a matched case-control study, it may not be appropriate to use the Bernoulli-based spatial scan statistic described above. The Bernoulli-based scan statistic is used for independent observations. The case and control within a matched pair are not independent. In addition, the hypotheses should be expressed in a different manner from (1) because the probability of being a case is not meaningful in matched case-control data. Suppose that we have n matched pairs, each of which was formed from one case and one control. Given window z, there are four possible statuses for each case-control pair with respect to whether they belong to z or not, as shown in Table 1. Both the case and control can belong to z, only one can, or neither could. Table 1 shows the probability (data) structure for the four possible states. For example, π11(n11) denote the probability (number of pairs) of belonging to window z for both the case and control. Then, we can express the hypotheses to search for clusters with high rates as follows. The situations satisfying the null hypotheses in (2) and (3) are referred to as marginal homogeneity and symmetry, respectively. Equivalently, we may write the hypotheses in (3) using an odds ratio (OR = π10/π01) as Here, we propose utilizing McNemar’s test statistic and Wald-type test statistic for the OR to define spatial scan statistics for matched case-control data. We define the first test statistic given z as and the second test statistic given z as The area with the maximum value of or over z∈Z becomes the most likely cluster. is simply the McNemar’s test statistic. is the squared Wald test statistic for log OR. log(n10/n01) is the conditional maximum likelihood (ML) estimate of log OR (log(π10/π01)), and is its standard error estimate. To search for clusters with low rates, we use the indicator function with the reversed inequality sign.

Table 1

Probability (data) structure for the matched case-control data with respect to belonging to window z (in) or not (out).

For a given z		Control
		in	out
Case	in	π₁₁ (n₁₁)	π₁₀ (n₁₀)	π₁₊ (n₁₊)
Case	out	π₀₁ (n₀₁)	π₀₀ (n₀₀)	π₀₊ (n₀₊)
		π₊₁ (n₊₁)	π₊₀ (n₊₀)	1 (n)

Although and are known to follow a chi-square distribution asymptotically, we do not know the null distributions of or . As for the standard spatial scan statistics, we use Monte Carlo hypothesis testing procedure for the statistical inference of the clusters detected using the proposed methods. Under the null hypothesis, we generate a large number of data sets by randomly permuting the locations of observations with matching ids fixed. Then, we calculate the maximum values of test statistics for each data set. In that way, we obtain empirical null distributions of the proposed test statistics. The Monte Carlo-based p-value for the detected cluster is the rank of the maximum value of the test statistics from the real data set among all data sets divided by the number of all data sets.

Results

Simulation study

We conducted a simulation study to evaluate the performance of the proposed methods. We used the area of Seoul, the capital city of South Korea, as the whole study region. Seoul is composed of 25 districts. We created a true cluster consisting of 5 districts in the northwest area as shown in Fig 1. The cluster include “Jongno-gu” district with 4 nearest neighbors. We set the total number of matched pairs to 100, 200, and 400. When searching for clusters, we set the maximum scanning window size to 50% of total number of matched pairs. We considered 5 different scenarios for the probabilities of π as shown in Table 2. The first scenario was included to evaluate whether the proposed methods adequately control the type I error rate. Across the 4 scenarios except for the first one, the odds ratios (ORs) (π10/π01) are different, while the unconditional marginal ORs (π1+π+0/ π0+π+1) are the same. The unconditional marginal OR refers to the OR as if we deal with the data from an unmatched case-control study. The 4 scenarios provide different information on the magnitude of risk for the cluster. However, the unconditional marginal OR cannot account for this.

Fig 1

A true cluster created for the simulation study.

Table 2

Five different scenarios assumed for the probability structure in the simulation study.

(π₁₁, π₁₀, π₀₁, π₀₀)	π₁₀/π₀₁	π₁₊π₊₀/ π₀₊π₊₁
(0.25, 0.25, 0.25, 0.25)	1	1
(0.05, 0.25, 0.15, 0.55)	1.67	1.71
(0.10, 0.20, 0.10, 0.60)	2	1.71
(0.15, 0.15, 0.05, 0.65)	3	1.71
(0.18, 0.12, 0.02, 0.68)	6	1.71

At each scenario, (n11,n10,n01,n00) were first generated from a multinomial distribution with index n and parameter (π11,π10,π01,π00). We set the same identification number to each pair of case and control to indicate matching from 1 to n. Then, each of cases and controls of the first n11 pairs were randomly assigned to one of 5 districts of the true cluster. Cases and controls of n10 pairs were randomly assigned to districts inside and outside the true cluster, respectively. Similarly, cases and controls of n01 pairs were randomly assigned to districts outside and inside the true cluster, respectively. Both the cases and controls of n00 pairs were randomly assigned to districts outside the true cluster. We compared the performance of the two proposed methods to that of the Bernoulli-based method. We estimated the power, sensitivity, and positive predicted value (PPV) from 1000 replications. Power was estimated as the number of rejected data sets out of 1000. For the first scenario (OR = 1), power is the type I error rate. Because power cannot show the accuracy of detected clusters, sensitivity and PPV were used to evaluate how accurately the methods can detect clusters. Sensitivity was defined as the proportion of districts detected correctly among the districts in the true cluster and PPV as the proportion of districts detected correctly among the districts in the detected cluster. These are commonly used when reporting simulation results in studies on spatial scan statistics [7–10, 14–17]. Larger values of sensitivity and PPV indicate higher accuracy of detected clusters. Sensitivity and PPV were estimated as the average among the rejected samples. Both the proposed methods and the Bernoulli-based method controlled the type I error rate less than the nominal level of 0.05. Estimated type I error rates for T(1) and T(2) were 0.030, 0.036, and 0.042, and 0.028, 0.034, and 0.039 for the number of matched pairs = 100, 200, and 400, respectively. The Bernoulli model had the type I error rates of 0.034, 0.026, and 0.041. Tables 3–5 show the simulation results listing the estimated power, sensitivity, and PPV with the number of matched pairs = 100, 200, and 400, respectively. Overall, the two proposed methods showed higher power, sensitivity, and PPV than the Bernoulli-based method. Moreover, the power of the proposed methods increased as the OR increased. However, the Bernoulli-based method showed similar performances across different scenarios because we assumed the same marginal OR.

Table 3

Estimated power, sensitivity, and PPV with the number of matched pairs = 100 (highest value across three methods is shown in bold).

(π₁₁, π₁₀, π₀₁, π₀₀)		T⁽¹⁾	T⁽²⁾	Bernoulli-based
(0.05, 0.25, 0.15, 0.55)	Power	0.153	0.156	0.113
	Sensitivity	0.799	0.842	0.720
	PPV	0.623	0.603	0.683
(0.10, 0.20, 0.10, 0.60)	Power	0.196	0.190	0.110
	Sensitivity	0.820	0.854	0.691
	PPV	0.689	0.633	0.676
(0.15, 0.15, 0.05, 0.65)	Power	0.249	0.187	0.085
	Sensitivity	0.865	0.882	0.664
	PPV	0.760	0.605	0.596
(0.18, 0.12, 0.02, 0.68)	Power	0.349	0.207	0.078
	Sensitivity	0.908	0.886	0.562
	PPV	0.793	0.567	0.586

Table 5

Estimated power, sensitivity, and PPV with the number of matched pairs = 400 (highest value across three methods is shown in bold).

(π₁₁, π₁₀, π₀₁, π₀₀)		T⁽¹⁾	T⁽²⁾	Bernoulli-based
(0.05, 0.25, 0.15, 0.55)	Power	0.585	0.598	0.586
	Sensitivity	0.886	0.897	0.882
	PPV	0.849	0.811	0.849
(0.10, 0.20, 0.10, 0.60)	Power	0.737	0.743	0.648
	Sensitivity	0.923	0.928	0.878
	PPV	0.859	0.848	0.821
(0.15, 0.15, 0.05, 0.65)	Power	0.920	0.914	0.689
	Sensitivity	0.966	0.969	0.865
	PPV	0.944	0.927	0.837
(0.18, 0.12, 0.02, 0.68)	Power	0.996	0.995	0.684
	Sensitivity	0.993	0.983	0.842
	PPV	0.981	0.943	0.820

We observed some consistent patterns in the results using the two proposed methods. T(1) showed higher power than T(2) except for the case of π10/π01 = 1.67. PPV was always higher in results from T(1) than from T(2). T(2) showed higher sensitivity than T(1) except for π10/π01 = 6. Although the difference in performance between the two proposed methods was marginal, T(1) appeared to perform slightly better under the scenarios we assumed.

Real data example

We used national health insurance service national sample cohort data, which is a randomly selected sample from population-based cohort data set containing insurance eligibility, medical treatments, medical care institutions, and general health examinations [18]. The data comprise approximately 2.2% of total eligible Korean population. Residential area information was available at the district level. From the sample cohort data, we identified 173 male cases diagnosed with lung cancer at general health examination for the year of 2013 in Seoul. Next, we randomly selected 173 controls with age-group matched to each case. We applied the two proposed methods and the Bernoulli-based method to the data. The Bernoulli-based method revealed no significant clusters, while the two proposed methods detected a significant cluster consisting of 11 districts in the midwest area of Seoul, with a p-value equal to 0.047 and 0.044, respectively (Fig 2). The observed counts for the matched pairs with respect to whether they were inside or outside the cluster were n11 = 19, n10 = 53, n01 = 25, and n00 = 76. The conditional ML estimate for the OR was 2.12. The cluster is the region with higher odds of male lung cancer than the remaining region, adjusting for age-group. In fact, the same area was the most likely cluster when the Bernoulli-based method was used, but the statistical significance was not obtained (p-value = 0.09). As shown in the simulation study, the result may reflect that the proposed methods have higher power than the Bernoulli-based spatial scan statistic for matched case-control data.

Fig 2

The most likely cluster detected by the two proposed methods.

Discussion and conclusion

We have proposed two spatial scan statistics for matched case-control data in this paper. The methods are based on McNemar’s test statistic and Wald-type test statistic for the OR. Therefore, we accounted for the correlation in responses within a matched pair. If we use the Bernoulli-based spatial scan statistic for matched case-control data, we ignore the correlation, and the cluster detection test results will suffer from low power. The simulation study clearly revealed that the proposed methods had higher power and higher accuracy for detecting spatial clusters for matched case-control data than the Bernoulli-based spatial scan statistic. The cluster detection result for the male lung cancer data also appeared to reflect a higher power of the proposed methods. The method based on McNemar’s test statistic appeared to perform slightly better than the other proposed method, although the difference was marginal. We have considered searching for clusters with high rates expressed with OR>1. We might be interested in clusters with low rates, i.e., OR<1. Because an OR is symmetric about 1, the reciprocal of an OR represents the same strength of association in opposite direction. Searching for clusters with OR<1 for cases relative to controls is identical to searching clusters with OR>1 for controls relative to cases. We can see this symmetry in the test statistics as well. Both the test statistic values of and remain the same even if n10 and n01 are switched except the identity function part. We can use the proposed methods to search for clusters with low rates by only switching the direction of inequality sign in the identity function. Here, we focused on the simplest situation of 1:1 matching, where one control was matched to each case. When multiple controls were matched to a single case (M:1 matching), we can still use the Wald-type test statistic for the regression coefficient from conditional logistic regression modeling. We may rewrite the hypotheses in (4) and test statistic as H0:β = 0 for all z∈Z vs. H:β>1 for some z∈Z in a conditional logistic regression model as follows. where x = 1 if subject t (t = 1 if case and t = 2 if control) in matched pair i belongs to z and x = 0 otherwise, and For 1:1 matched data, the OR (π10/π01) is identical to exp(β). The conditional logistic model (5) can be extended to M:1 matched data. We only need to modify the model using t = 1,…,M+1 to indicate one case and M controls. We can estimate β and the estimate’s standard error based on the conditional ML method. Then, the Wald-type test statistic can be constructed in the same manner. However, further evaluation in a simulation study is warranted to evaluate the method. We only considered circular windows. Other shapes of scanning windows such as elliptic or irregular windows have been extensively studied [17, 19–24]. It would be interesting to evaluate the proposed methods using other shapes of windows. In addition to the proposed test statistics in this paper, it may be possible to use other types of test statistics for matched case-control data. In conclusion, the proposed methods are very useful for spatial cluster detection for matched case-control data.

A zip file including the lung cancer data set and a sample R code.

(ZIP) Click here for additional data file.

Table 4

Estimated power, sensitivity, and PPV with the number of matched pairs = 200 (highest value across three methods is shown in bold).

(π₁₁, π₁₀, π₀₁, π₀₀)		T⁽¹⁾	T⁽²⁾	Bernoulli-based
(0.05, 0.25, 0.15, 0.55)	Power	0.289	0.303	0.184
	Sensitivity	0.845	0.863	0.837
	PPV	0.732	0.715	0.781
(0.10, 0.20, 0.10, 0.60)	Power	0.375	0.372	0.232
	Sensitivity	0.864	0.888	0.799
	PPV	0.764	0.744	0.734
(0.15, 0.15, 0.05, 0.65)	Power	0.563	0.548	0.210
	Sensitivity	0.923	0.930	0.766
	PPV	0.865	0.817	0.684
(0.18, 0.12, 0.02, 0.68)	Power	0.811	0.678	0.158
	Sensitivity	0.958	0.925	0.747
	PPV	0.921	0.803	0.683

13 in total

1. A spatial scan statistic for ordinal data.

Authors: Inkyung Jung; Martin Kulldorff; Ann C Klassen
Journal: Stat Med Date: 2007-03-30 Impact factor: 2.373

2. An elliptic spatial scan statistic.

Authors: Martin Kulldorff; Lan Huang; Linda Pickle; Luiz Duczmal
Journal: Stat Med Date: 2006-11-30 Impact factor: 2.373

3. Fast detection of arbitrarily shaped disease clusters.

Authors: R Assunção; M Costa; A Tavares; S Ferreira
Journal: Stat Med Date: 2006-03-15 Impact factor: 2.373

4. A spatial scan statistic for survival data.

Authors: Lan Huang; Martin Kulldorff; David Gregorio
Journal: Biometrics Date: 2007-03 Impact factor: 2.571

5. A spatial scan statistic for survival data based on Weibull distribution.

Authors: Vijaya Bhatt; Neeraj Tiwari
Journal: Stat Med Date: 2013-12-18 Impact factor: 2.373

6. A spatial scan statistic for multinomial data.

Authors: Inkyung Jung; Martin Kulldorff; Otukei John Richard
Journal: Stat Med Date: 2010-08-15 Impact factor: 2.373

7. A flexibly shaped spatial scan statistic for detecting clusters.

Authors: Toshiro Tango; Kunihiko Takahashi
Journal: Int J Health Geogr Date: 2005-05-18 Impact factor: 3.918

8. A scan statistic for continuous data based on the normal probability model.

Authors: Martin Kulldorff; Lan Huang; Kevin Konty
Journal: Int J Health Geogr Date: 2009-10-20 Impact factor: 3.918

9. Space-time clustering of non-hodgkin lymphoma using residential histories in a Danish case-control study.

Authors: Rikke Baastrup Nordsborg; Jaymie R Meliker; Annette Kjær Ersbøll; Geoffrey M Jacquez; Ole Raaschou-Nielsen
Journal: PLoS One Date: 2013-04-01 Impact factor: 3.240

10. Sarcoma risk and dioxin emissions from incinerators and industrial plants: a population-based case-control study (Italy).

Authors: Paola Zambon; Paolo Ricci; Emanuela Bovo; Alessandro Casula; Massimo Gattolin; Anna Rita Fiore; Francesco Chiosi; Stefano Guzzinati
Journal: Environ Health Date: 2007-07-16 Impact factor: 5.984