| Literature DB >> 25264894 |
Matthijs Blankers1, Tom Frijns2, Vendula Belackova3, Carla Rossi4, Bengt Svensson5, Franz Trautmann2, Margriet van Laar2.
Abstract
INTRODUCTION: Cannabis is Europe's most commonly used illicit drug. Some users do not develop dependence or other problems, whereas others do. Many factors are associated with the occurrence of cannabis-related disorders. This makes it difficult to identify key risk factors and markers to profile at-risk cannabis users using traditional hypothesis-driven approaches. Therefore, the use of a data-mining technique called binary recursive partitioning is demonstrated in this study by creating a classification tree to profile at-risk users.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25264894 PMCID: PMC4180744 DOI: 10.1371/journal.pone.0108298
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Demographic statistics of the participants from the four countries.
| Combined (n = 2617) | Czech (n = 386) | Italy (n = 800) | Netherlands (n = 839) | Sweden (n = 592) | Tests | ||
| Characteristic | M(SD) | n(%) | M(SD) | n(%) | M(SD) | n(%) | M(SD) | n(%) | M(SD) | n(%) | F(3,2613) | χ2(3) |
|
| Male | 2009 (77%) | 268 (69%) | 630 (79%) | 590 (70%) | 521 (88%) | 74.89 | <0.0001 |
| Age (years) | 25.6 (8.53) | 23.8 (8.30) | 26.0 (7.92) | 25.8 (8.70) | 26.0 (9.08) | 6.67 | 0.0002 |
| Student | 1119 (44%) | 214 (56%) | 428 (55%) | 259 (33%) | 218 (37%) | 108.10 | <0.0001 |
| Res. Urbanisation level | 244.00 | <0.0001 | |||||
| City | 757 (30%) | 49 (13%) | 266 (34%) | 155 (20%) | 287 (49%) | ||
| Town | 1278 (51%) | 267 (70%) | 404 (52%) | 414 (53%) | 193 (33%) | ||
| Village | 492 (19%) | 64 (17%) | 112 (14%) | 214 (27%) | 102 (18%) | ||
| Living alone | 420 (17%) | 30 (8.0%) | 97 (12%) | 119 (15%) | 174 (30%) | 106.36 | <0.0001 |
| Unemployed | 230 (9.1%) | 20 (5.3%) | 85 (11%) | 71 (9.1%) | 54 (9.3%) | 9.67 | 0.022 |
| Age at first cannabis use | 16.4 (3.33) | 15.5 (2.41) | 16.1 (2.19) | 16.2 (3.91) | 17.5 (3.91) | 36.80 | <0.0001 |
| Age at regular use | 18.3 (4.73) | 17.4 (3.53) | 18.2 (4.27) | 17.8 (4.98) | 20.0 (5.38) | 28.92 | <0.0001 |
| Number of use days (last year) | 153.34 | <0.0001 | |||||
| Incidentally (0–10 days) | 691 (27%) | 89 (23%) | 141 (18%) | 297 (35%) | 164 (28%) | ||
| Moderately (11–100 days) | 791 (30%) | 118 (31%) | 199 (25%) | 247 (29%) | 227 (39%) | ||
| Frequently (101–300 days) | 645 (25%) | 100 (26%) | 263 (33%) | 143 (17%) | 139 (24%) | ||
| (Almost) daily (>300 days) | 480 (18%) | 77 (20%) | 194 (24%) | 151 (18%) | 58 (9.9%) | ||
| Number of units per typical day | 2.70 (2.35) | 2.65 (2.14) | 2.98 (2.40) | 2.27 (2.14) | 2.95 (2.59) | 13.85 | <0.0001 |
| Cannabis use on weekdays | 1242 (47%) | 187 (48%) | 519 (65%) | 306 (36%) | 230 (39%) | 155.67 | <0.0001 |
| Cannabis use any time of day | 804 (31%) | 142 (37%) | 298 (37%) | 201 (24%) | 163 (28%) | 43.56 | <0.0001 |
| Amphetamine lifetime use | 857 (33%) | 126 (33%) | 100 (13%) | 456 (55%) | 175 (30%) | 327.50 | <0.0001 |
| Cocaine lifetime use | 850 (33%) | 80 (21%) | 217 (27%) | 424 (51%) | 129 (22%) | 189.29 | <0.0001 |
| Ecstacy lifetime use | 983 (38%) | 129 (34%) | 113 (14%) | 605 (73%) | 136 (23%) | 668.94 | <0.0001 |
| CAST sum score | 5.79 (4.32) | 5.74 (4.22) | 5.88 (3.76) | 5.63 (4.91) | 5.93 (4.21) | 0.70 | 0.553 |
| CAST sum score> = 7 | 1058 (40%) | 151 (39%) | 328 (41%) | 324 (39%) | 255 (43%) | 3.25 | 0.355 |
Note: Percentages are based on available cases (missing values omitted) Regular use is defined as use at least once per month; Number of use days is an aggregated version of the number of use days predictor used in the analysis, which has 12 levels; Lifetime use refers to any use/incidence of use; Cannabis use on weekdays refers to the question whether participants use cannabis in the weekends only, or also (or mainly) on working days; Presented in the proportion indicating that they use cannabis also/mainly on weekdays; Cannabis use any time of day refers to the question whether participants use cannabis on a specific time of the day or not; Presented is the proportion indicating that they use cannabis on any time of the day; Sample sizes of the four countries (and combined) are lower than the indicated number for some variables due to item missingness. F tests and chi-squared tests were performed to test the difference between the four countries. Samples are not representative of the general population in any of the four countries.
Ten models with optimal fit.
| Model | Nagelkerke R2 | AIC |
| CAST7∼freq_12mo+quant_typical_day+last_use_alone+coke_LT+sex+age_first_use | 0.420 | 1805 |
| CAST7∼freq_12mo+quant_last_day+last_use_alone+coke_LT+spice_LT+coke_LY | 0.412 | 2034 |
| CAST7∼freq_12mo+use_anytime+last_use_alone+buy_my_own+amph_LT+police_found_12mo | 0.410 | 1960 |
| CAST7∼freq_12mo+quant_typical_day+last_use_alone+coke_LT+spice_LY+ketam_LT | 0.408 | 1833 |
| CAST7∼freq_12mo+use_weekdays+quant_typical_day+last_use_alone+coke_LT | 0.408 | 1833 |
| CAST7∼freq_12mo+quant_last_day+last_use_alone+sex+age_first_use+spice_LY | 0.407 | 2038 |
| CAST7∼freq_12mo+last_use_alone+coke_LT+weed_user+age_first_use+unemployed | 0.407 | 1974 |
| CAST7∼freq_12mo+quant_typical_day+last_use_alone+sex+amph_LY | 0.404 | 1840 |
| CAST7∼freq_12mo+quant_typical_day+last_use_alone+amph_LT+spice_LY | 0.404 | 1841 |
| CAST7∼freq_12mo+quant_typical_day+last_use_alone+police_found_can_12mo+amph_LY | 0.404 | 1841 |
Note: Variables correspond to those described in the section ‘Bivariate analysis’. AIC is Akaike Information Criterion, Nagelkerke R2 provides a goodness-of-fit index between 0–1.
Figure 1Recursive partitioning classification tree analysis of probability CAST ≥7 for four countries.
Note: Generic classification model based on training data (n = 2074). The variables in the numbered boxes indicate the splitting variables identified in the recursive partitioning analysis. The cut-off value for each split, and the number of participants involved in each split is indicated next to the arrows diverting participants from the splitting variable. The six grey area's in the bottom of the lowest boxes (“terminal nodes”), and the “p” in these boxes indicates the proportion of participants in each partitioned area with scores of 7 or higher on the CAST. “n” in the lowest boxes indicates the number of participants in each of the terminal nodes. For each of the 5 splits, p<0.001.
Performance statistics of the generic classification tree model.
| Dataset |
| Accuracy (95% CI) | No Information Rate |
| Sensitivity | Specificity | Positive Predictive Value | Negative Predictive Value |
| Training | 2074 | 0.73 (0.71–0.75) | 0.59 | <0.0001 | 0.83 | 0.66 | 0.63 | 0.85 |
| Validation | 543 | 0.69 (0.65–0.73) | 0.60 | <0.0001 | 0.80 | 0.62 | 0.58 | 0.82 |
| Full | 2617 | 0.72 (0.70–0.74) | 0.60 | <0.0001 | 0.83 | 0.65 | 0.62 | 0.85 |
| Czech | 386 | 0.67 (0.62–0.72) | 0.61 | 0.0067 | 0.85 | 0.55 | 0.55 | 0.86 |
| Italy | 800 | 0.66 (0.62–0.69) | 0.59 | <0.0001 | 0.90 | 0.49 | 0.55 | 0.87 |
| Netherlands | 839 | 0.80 (0.77–0.83) | 0.61 | <0.0001 | 0.82 | 0.79 | 0.71 | 0.87 |
| Sweden | 592 | 0.73 (0.70–0.77) | 0.57 | <0.0001 | 0.74 | 0.73 | 0.67 | 0.79 |
Note: Accuracy indicates the proportion of correctly classified cases, with associated confidence interval; No information rate (NIR) is 1-[proportion of CAST≥7] in the sample; Difference between model accuracy and NIR tested using a one-sided binomial test; Sensitivity = [number of correctly classified CAST ≥7]/[number of CAST ≥7 in sample]; Specificity = [number of correctly classified CAST <7]/[number of CAST <7 in sample]; Positive predictive value = [number of correctly classified CAST ≥7]/[all classified CAST ≥7]; Negative predictive value = [number of correctly classified CAST <7]/[all classified CAST <7]. The country datasets contain both participants from the training and the validation dataset.
Figure 2Country specific classification tree models.
Note: Country specific classification tree models for the Czech Republic (top-left), Italy (top-right), Netherlands (bottom-left) and Sweden (bottom-right). The variables in the numbered boxes indicate the splitting variables identified in the recursive partitioning analysis. The cut-off value for each split, and the number of participants involved in each split is indicated next to the arrows diverting participants from the splitting variable. The six grey area's in the bottom of the lowest boxes (“terminal nodes”), and the “p” in these boxes indicates the proportion of participants in each partitioned area with scores of 7 or higher on the CAST. “n” in the lowest boxes indicates the number of participants in each of the terminal nodes.
Country specific classification tree models compared to the generic tree model.
| Dataset |
| Country Specific Model Accuracy (95% CI) | Generic Model Accuracy (95% CI) | χ2(1) |
| Sensitivity | Specificity | Positive Predictive Value | Negative Predictive Value |
| Czech | 386 | 0.73 (0.69–0.78) | 0.67 (0.62–0.72) | 2.99 | 0.08 | 0.70 | 0.76 | 0.65 | 0.79 |
| Italy | 800 | 0.72 (0.69–0.75) | 0.66 (0.62–0.69) | 6.45 | 0.01 | 0.61 | 0.80 | 0.68 | 0.75 |
| Netherlands | 839 | 0.80 (0.77–0.82) | 0.80 (0.77–0.83) | 0.00 | 1.00 | 0.81 | 0.77 | 0.71 | 0.87 |
| Sweden | 592 | 0.75 (0.71–0.79) | 0.73 (0.70–0.77) | 0.53 | 0.47 | 0.76 | 0.74 | 0.69 | 0.81 |
Note: Country Specific Model refers to the classification trees presented in Figure 2; Generic Model refers to the overarching model presented in Figure 1; Accuracy indicates the proportion of correctly classified cases, with associated confidence interval; Difference in accuracy between Country Specific Model and Generic Model was tested using a 2-sided Chi-Square test; Sensitivity = [number of correctly classified CAST ≥7]/[number of CAST ≥7 in sample]; Specificity = [number of correctly classified CAST <7]/[number of CAST <7 in sample]; Positive predictive value = [number of correctly classified CAST ≥7]/[all classified CAST ≥7]; Negative predictive value = [number of correctly classified CAST <7]/[all classified CAST <7]. The country datasets contain both participants from the training and the validation dataset.