| Literature DB >> 34424538 |
Evan L Busch1,2.
Abstract
In research, policy, and practice, continuous variables are often categorized. Statisticians have generally advised against categorization for many reasons, such as loss of information and precision as well as distortion of estimated statistics. Here, a different kind of problem with categorization is considered: the idea that, for a given continuous variable, there is a unique set of cut points that is the objectively correct or best categorization. It is shown that this is unlikely to be the case because categorized variables typically exist in webs of statistical relationships with other variables. The choice of cut points for a categorized variable can influence the values of many statistics relating that variable to others. This essay explores the substantive trade-offs that can arise between different possible cut points to categorize a continuous variable, making it difficult to say that any particular categorization is objectively best. Limitations of different approaches to selecting cut points are discussed. Contextual trade-offs may often be an argument against categorization. At the very least, such trade-offs mean that research inferences, or decisions about policy or practice, that involve categorized variables should be framed and acted upon with flexibility and humility. LAYEntities:
Keywords: data analysis; statistical data interpretation; statistics; translational medical research; translational medical science
Year: 2021 PMID: 34424538 PMCID: PMC8578203 DOI: 10.1002/cncr.33838
Source DB: PubMed Journal: Cancer ISSN: 0008-543X Impact factor: 6.921
Figure 1Simple process of health or disease.
Associations of Endometrial Tumor ER Expression With Obesity and Mortality Outcomes
| ER Cut Point, % | Obesity/ER Association | ER/All‐Cause Mortality Association | ER/Cancer‐Specific Mortality Association | ||||||
|---|---|---|---|---|---|---|---|---|---|
| OR | 95% CI | CLR | HR | 95% CI | CLR | HR | 95% CI | CLR | |
| 0 | 2.83 | 1.26‐6.37 | 5.06 | 0.62 | 0.29‐1.30 | 4.48 | 0.32 | 0.13‐0.83 | 6.38 |
| 10 | 2.92 | 1.34‐6.33 | 4.72 | 0.61 | 0.30‐1.22 | 4.07 | 0.27 | 0.11‐0.65 | 5.91 |
| 20 | 2.40 | 1.22‐4.74 | 3.89 | 0.55 | 0.30‐1.03 | 3.43 | 0.29 | 0.12‐0.69 | 5.75 |
| 30 | 1.54 | 0.86‐2.75 | 3.20 | 0.55 | 0.31‐0.97 | 3.13 | 0.23 | 0.10‐0.51 | 5.10 |
| 40 | 1.35 | 0.78‐2.36 | 3.03 | 0.55 | 0.31‐0.96 | 3.10 | 0.21 | 0.09‐0.48 | 5.33 |
| 50 | 1.10 | 0.65‐1.87 | 2.88 | 0.59 | 0.34‐1.02 | 3.00 | 0.20 | 0.09‐0.47 | 5.22 |
Abbreviations: CI, confidence interval; CLR, confidence limit ratio (upper limit/lower limit); ER, estrogen receptor; HR, hazard ratio; OR, odds ratio.
ER expression was measured as the continuous percentage of positive tumor cells (0%‐100%) and then dichotomized at a given cut point (ER+ vs ER–). ER+ was defined as expression at or above the cut point except for a cut point of 0%, where ER+ was only expression above the cut point. The dichotomous ER status was the dependent variable in obesity‐ER models and an independent variable in ER‐mortality models. The obesity variable was a dichotomization of the body mass index (≥30 vs <30 kg/m2). This table was adapted with permission from Tables 2 and 4 in Busch et al.
Prediction of All‐Cause Mortality After the Addition of E‐Cadherin Measurements to Standard Diagnostic Tests of Cancer Cell Detachment From Colorectal Primary Tumors
| E‐Cadherin Variable Added to Standard Tests | ||||
|---|---|---|---|---|
| Continuous | Dichotomous E‐Cadherin Cut Point | |||
| 0.52 | 0.60 | 0.85 | ||
| C‐index, % (95% CI) | 66 (58 to 72) | 51 (41 to 59) | 54 (45 to 62) | 56 (48 to 63) |
| Reclassification metric | ||||
| No. (%) moved to higher risk category | 47 (25) | 11 (6) | 27 (14) | 41 (22) |
| No. (%) moved to lower risk category | 55 (29) | 93 (49) | 83 (44) | 70 (37) |
| Total No. (%) reclassified | 102 (54) | 104 (55) | 110 (59) | 111 (59) |
| Reclassification calibration statistic | .1 | .1 | .1 | .2 |
| Event net reclassification index, % (95% CI) | 14 (–11 to 30) | –22 (–38 to –7) | –7 (–23 to 10) | 3 (–15 to 21) |
| Nonevent net reclassification index, % (95% CI) | 13 (3 to 35) | 54 (44 to 63) | 41 (29 to 52) | 24 (12 to 37) |
| Integrated discrimination improvement, % (95% CI) | 3.4 (1.9 to 5.6) | 4.3 (2.2 to 6.8) | 3.4 (1.8 to 5.3) | 3.7 (1.7 to 5.9) |
Abbreviation: CI, confidence interval.
E‐cadherin was measured on a continuous average intensity scale of 0 to 3 and then modeled as either continuous or dichotomized at a given cut point. Each C‐index value is for a Cox model of all‐cause mortality based on standard diagnostic tests of cancer cell detachment (lymph node evaluation and radiologic imaging) plus the respective E‐cadherin variable. Reclassification metrics compare a Cox model of standard diagnostic tests estimating all‐cause mortality to a Cox model of standard diagnostic tests plus the respective E‐cadherin variable. This table was modified with permission from Table 4 in Busch et al. , which was published under a CC BY 4.0 license.
Quantities Sensitive to Choices of Variable Category Cut‐Point Values
| Number and proportion of units within the category |
| Measures of association (relative and absolute) |
| Model fit statistics (eg, AIC and BIC) |
|
|
| Hypothesis test statistics |
| Correlations |
| Splines |
| Sensitivity |
| Specificity |
| Positive predictive value |
| Negative predictive value |
| C‐statistic (ie, area under ROC curve) |
| C‐index |
| Predicted probability of an outcome |
| Event net reclassification index |
| Nonevent net reclassification index |
| Integrated discrimination improvement |
| Reclassification calibration statistic |
| Number and proportion of units reclassified across outcome risk categories |
Abbreviations: AIC, Akaike information criterion; BIC, Bayesian information criterion; ROC, receiver operating characteristic.
The list in this table is not exhaustive.
Both the magnitude and the precision of the measure are sensitive to cut‐point selection.
Figure 2Partial context of body mass index: a sampling of upstream and downstream variables.