| Literature DB >> 29261690 |
Guogen Shan1, Shawn Gerstenberger1.
Abstract
This research is motivated by one of our survey studies to assess the potential influence of introducing zebra mussels to the Lake Mead National Recreation Area, Nevada. One research question in this study is to investigate the association between the boating activity type and the awareness of zebra mussels. A chi-squared test is often used for testing independence between two factors with nominal levels. When the null hypothesis of independence between two factors is rejected, we are often left wondering where does the significance come from. Cell residuals, including standardized residuals and adjusted residuals, are traditionally used in testing for cell significance, which is often known as a post hoc test after a statistically significant chi-squared test. In practice, the limiting distributions of these residuals are utilized for statistical inference. However, they may lead to different conclusions based on the calculated p-values, and their p-values could be over- o6r under-estimated due to the unsatisfactory performance of asymptotic approaches with regards to type I error control. In this article, we propose new exact p-values by using Fisher's approach based on three commonly used test statistics to order the sample space. We theoretically prove that the proposed new exact p-values based on these test statistics are the same. Based on our extensive simulation studies, we show that the existing asymptotic approach based on adjusted residual is often more likely to reject the null hypothesis as compared to the exact approach due to the inflated family-wise error rates as observed. We would recommend the proposed exact p-value for use in practice as a valuable post hoc analysis technique for chi-squared analysis.Entities:
Mesh:
Year: 2017 PMID: 29261690 PMCID: PMC5737889 DOI: 10.1371/journal.pone.0188709
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Awareness of zebra mussels of boaters from Lake Mead National Recreation Area, Nevada, USA.
| Boater activity type | |||||
|---|---|---|---|---|---|
| Awareness | Pleasure | Angler | Jet Ski | Other | Total |
| Yes | 139 | 15 | 5 | 4 | 163 |
| No | 68 | 15 | 17 | 11 | 111 |
| Total | 207 | 30 | 22 | 15 | 274 |
Reorganized data for testing the independence from the ij-th cell.
| Other columns combined | Total | ||
|---|---|---|---|
| Other rows combined | |||
Data from the malignant melanoma example for testing independence between tumor type and tumor site.
| Tumor site | ||||
|---|---|---|---|---|
| Tumor type | Extremities | Head and neck | Trunk | Total |
| Hutchinsonś melanotic freckle (H) | 10 | 22 | 2 | 34 |
| Indeterminate (I) | 28 | 11 | 17 | 56 |
| Nodular (N) | 73 | 19 | 33 | 125 |
| Superficial spreading melanoma (S) | 115 | 16 | 54 | 185 |
| Total | 226 | 68 | 106 | 400 |
P-value calculation for each cell of data from the malignant melanoma example.
The calculated p-value for each cell is compared to the multiple comparison correction method by Simes [18]. The cells with significant p-values are bold.
| Exact P-value | ||||||||
|---|---|---|---|---|---|---|---|---|
| Site | Type | Freq | P-value | P-value | ||||
| Head neck | H | 22 | 263.09 | 45.52 | 59.93 | |||
| Head neck | S | 16 | 238.70 | 7.59 | 17.01 | |||
| Extremities | H | 10 | 84.82 | 4.42 | 3.56×10−2 | 11.09 | ||
| Trunk | H | 2 | 49.14 | 5.45 | 1.95×10−2 | 8.11 | ||
| Extremities | S | 115 | 109.73 | 1.05 | 3.06×10−1 | 4.49 | 3.41×10−2 | 4.29×10−2 |
| Trunk | S | 54 | 24.75 | 0.50 | 4.77×10−1 | 1.28 | 2.58×10−1 | 3.07×10−1 |
| Extremities | I | 28 | 13.25 | 0.42 | 5.18×10−1 | 1.12 | 2.90×10−1 | 3.11×10−1 |
| Trunk | I | 17 | 4.67 | 0.31 | 5.75×10−1 | 0.50 | 4.81×10−1 | 5.14×10−1 |
| Head neck | N | 19 | 5.06 | 0.24 | 6.25×10−1 | 0.42 | 5.18×10−1 | 5.68×10−1 |
| Extremities | N | 73 | 5.64 | 0.08 | 7.77×10−1 | 0.27 | 6.05×10−1 | 6.64×10−1 |
| Head neck | I | 11 | 2.19 | 0.23 | 6.31×10−1 | 0.32 | 5.70×10−1 | 7.02×10−1 |
| Trunk | N | 33 | 0.02 | 0.00 | 9.83×10−1 | 0.00 | 9.76×10−1 | 1.00 |
P-value calculation for each cell of data from the survey for the awareness of zebra mussels.
The calculated p-value for each cell is compared to the multiple comparison correction method by Simes [18]. The cells with significant p-values are bold.
| Exact P-value | ||||||||
|---|---|---|---|---|---|---|---|---|
| Site | Type | Freq | P-value | P-value | ||||
| Pleasure | Yes | 68 | 251.47 | 3.00 | 0.08 | 20.61 | ||
| Pleasure | No | 139 | 251.47 | 2.04 | 0.15 | 20.61 | ||
| Jet Ski | Yes | 5 | 65.41 | 5.00 | 0.03 | 13.41 | ||
| Jet Ski | No | 17 | 65.41 | 7.34 | 0.01 | 13.41 | ||
| Other | Yes | 4 | 24.24 | 2.72 | 0.10 | 7.09 | ||
| Other | No | 11 | 24.24 | 3.99 | 0.05 | 7.09 | ||
| Angler | Yes | 15 | 8.10 | 0.45 | 0.50 | 1.26 | 2.62×10−01 | 3.25×10−01 |
| Angler | No | 15 | 8.10 | 0.67 | 0.41 | 1.26 | 2.62×10−01 | 3.25×10−01 |
Fig 1Actual family-wise error rates of the proposed exact approach and the existing asymptotic approach based on the adjusted residual at the nominal level of 0.05.
For a 3 × 3 contingency table, frequency (Freq) and proportion (Prop) of simulated data having at least one cell is significant based on either T or exact p-value, from a total of 2 million simulations.
Γ and Γ are the number of cells with significant p-values by using the asymptotic approach and the exact approach, respectively.
| N = 50 | N = 100 | N = 300 | N = 500 | |||||
|---|---|---|---|---|---|---|---|---|
| Freq | Prop | Freq | Prop | Freq | Prop | Freq | Prop | |
| Γ | 28975 | 34.69 | 38868 | 45.79 | 52616 | 61.50 | 58046 | 68.15 |
| Γ | 41845 | 50.10 | 32801 | 38.64 | 21705 | 25.37 | 17256 | 20.26 |
| Γ | 12700 | 15.21 | 13133 | 15.47 | 11136 | 13.02 | 9761 | 11.46 |
| Γ | 0 | 0.00 | 74 | 0.09 | 82 | 0.10 | 77 | 0.09 |
| Γ | 2 | 0.00 | 12 | 0.01 | 13 | 0.02 | 30 | 0.04 |
| Total | 83522 | 100 | 84888 | 100 | 85552 | 100 | 85170 | 100 |
For a 3 × 5 contingency table, frequency (Freq) and proportion (Prop) of simulated data having at least one cell is significant based on either T or exact p-value, from a total of 2 million simulations.
Γ and Γ are the number of cells with significant p-values by using the asymptotic approach and the exact approach, respectively.
| N = 50 | N = 100 | N = 300 | N = 500 | |||||
|---|---|---|---|---|---|---|---|---|
| Freq | Prop | Freq | Prop | Freq | Prop | Freq | Prop | |
| Γ | 27157 | 27.72 | 39202 | 39.40 | 54207 | 55.71 | 60262 | 62.71 |
| Γ | 64211 | 65.53 | 51568 | 51.83 | 34503 | 35.46 | 28220 | 29.37 |
| Γ | 6614 | 6.75 | 8427 | 8.47 | 8269 | 8.50 | 7373 | 7.67 |
| Γ | 0 | 0.00 | 268 | 0.27 | 267 | 0.27 | 196 | 0.20 |
| Γ | 0 | 0.00 | 23 | 0.02 | 58 | 0.06 | 42 | 0.04 |
| Total | 97982 | 100 | 99488 | 100 | 97304 | 100 | 96093 | 100 |
For a 5 × 5 contingency table, frequency (Freq) and proportion (Prop) of simulated data having at least one cell is significant based on either T or exact p-value, from a total of 2 million simulations.
Γ and Γ are the number of cells with significant p-values by using the asymptotic approach and the exact approach, respectively.
| N = 50 | N = 100 | N = 300 | N = 500 | |||||
|---|---|---|---|---|---|---|---|---|
| Freq | Prop | Freq | Prop | Freq | Prop | Freq | Prop | |
| Γ | 27316 | 17.71 | 37304 | 26.62 | 54315 | 43.55 | 61890 | 52.49 |
| Γ | 123066 | 79.79 | 97514 | 69.58 | 64434 | 51.66 | 50217 | 42.59 |
| Γ | 3853 | 2.50 | 5167 | 3.69 | 5565 | 4.46 | 5445 | 4.62 |
| Γ | 0 | 0.00 | 148 | 0.11 | 329 | 0.26 | 278 | 0.24 |
| Γ | 0 | 0.00 | 10 | 0.01 | 81 | 0.06 | 73 | 0.06 |
| Total | 154235 | 100 | 140143 | 100 | 124724 | 100 | 117903 | 100 |