| Literature DB >> 28841657 |
Sung Noh Hong1, Hee Jung Son1,2, Sun Kyu Choi3, Dong Kyung Chang1, Young-Ho Kim1, Sin-Ho Jung3,4, Poong-Lyul Rhee1.
Abstract
BACKGROUND: An electronic medical record (EMR) database of a large unselected population who received screening colonoscopies may minimize sampling error and represent real-world estimates of risk for screening target lesions of advanced colorectal neoplasia (CRN). Our aim was to develop and validate a prediction model for assessing the probability of advanced CRN using a clinical data warehouse.Entities:
Mesh:
Year: 2017 PMID: 28841657 PMCID: PMC5571924 DOI: 10.1371/journal.pone.0181040
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Process diagram of a Concept Extraction-based Text Analysis System.
Fig 2Text pre-processing.
Fig 3Concept extraction process.
Comparison of concept extraction results and manual data extraction.
| Category | Precision (%) | Recall (%) | Category | Precision (%) | Recall (%) |
|---|---|---|---|---|---|
| PAST MEDICAL HISTORY | 100.00 | 100.00 | LESION | 100.00 | 100.00 |
| ANTITHROMBOTICS | 100.00 | 100.00 | ABNORMALITY | 100.00 | 100.00 |
| FAMILY HISTORY OF CANCER | 100.00 | 100.00 | HISTOLOGICAL CLASSIFICATIONADJ | 100.00 | 100.00 |
| LAST COLONOSCOPY | 100.00 | 100.00 | HISTOLOGIC TYPE | 100.00 | 100.00 |
| INDICATION | 100.00 | 97.75 | TUMOR GRADING | 100.00 | 100.00 |
| SEDATION | 100.00 | 100.00 | SIZE | 92.05 | 98.78 |
| MIDAZOLAM | 100.00 | 100.00 | NUMBER | 100.00 | 100.00 |
| PETHIDINE | 100.00 | 100.00 | SHAPE | 100.00 | 100.00 |
| LEVEL OF SEDATION | 100.00 | 98.88 | COLOR | 100.00 | 100.00 |
| PARADOXICAL RESPONSE | 100.00 | 100.00 | VIDEO | 95.51 | 100.00 |
| ANTISPASMODICS | 100.00 | 100.00 | SLIDE | 95.51 | 100.00 |
| CIMETROPIUM | 100.00 | 100.00 | ORGAN | 100.00 | 100.00 |
| DIGITAL RECTAL EXAMINATION | 96.63 | 100.00 | BIOPSY | 100.00 | 100.00 |
| BOWEL PREPARATION | 100.00 | 100.00 | BIOPSY STATUS | 98.88 | 100.00 |
| CECAL INBUTIONTIME | 100.00 | 97.75 | BIOPSY METHOD | 93.26 | 100.00 |
| WITHDRAWAL TIME | 100.00 | 97.75 | SUBMUCOSAL INJECTION | 100.00 | 100.00 |
| INSERTED UPTO | 100.00 | 98.88 | HEMOSTASIS | 100.00 | 100.00 |
| ORGAN | 100.00 | 100.00 | DIAGNOSIS | 100.00 | 100.00 |
| SITE | 98.88 | 100.00 | IMPRESSION | 100.00 | 100.00 |
Fig 4Flow diagram of the study population.
Clinical characteristics of enrolled subjects.
| Variable | Total | Training set | Validation set | ||||
|---|---|---|---|---|---|---|---|
| N | Value | N | Value | N | Value | ||
| Demographics | |||||||
| Age (years), mean ± SD | 49,450 | 49.9 ± 9.3 | 24,726 | 49.9 ± 9.4 | 24,724 | 49.8 ± 9.3 | 0.160 |
| Sex | 49,450 | 24,726 | 24,724 | 0.008 | |||
| Female, n (%) | 21,762 (44.0) | 10,735 (43.4) | 11,027 (44.6) | ||||
| Male, n (%) | 27,688 (56.0) | 13,991 (56.6) | 13,697 (55.4) | ||||
| Family history of colorectal cancer | 45,583 | 22,759 | 21,706 | 0.694 | |||
| Yes, n (%) | 2,251 (4.9) | 1,133 (5.0) | 1,118 (4.9) | ||||
| No, n (%) | 43,332(95.1) | 21,626 (95.0) | 21,706 (95.1) | ||||
| Physical measurement | |||||||
| Body mass index, mean ± SD | 44,581 | 23.7 ± 3.1 | 22,275 | 23.7 ± 3.1 | 22,306 | 23.6 ± 3.1 | 0.410 |
| Waist circumference (cm), mean ± SD | 44,145 | 83.4 ± 44.3 | 22,057 | 83.7 ± 62.0 | 22,088 | 83.2 ± 9.1 | 0.229 |
| Body fat percentage | 49,058 | 25.5 ± 6.6 | 24,517 | 25.4 ± 6.5 | 24,541 | 25.5 ± 6.6 | 0.813 |
| Cigarette smoking | |||||||
| Smoking status | 42,579 | 21,271 | 21,308 | 0.319 | |||
| Non-smoker, n (%) | 23,841 (56.0) | 11,838 (55.7) | 12,003 (56.4) | ||||
| Ex-smoker, n (%) | 5,260 (12.3) | 2,631 (12.3) | 2,629 (12.3) | ||||
| Current smoker, n (%) | 13,478 (31.7) | 6,802 (32.0) | 6,676 (31.3) | ||||
| Smoking duration (year), mean ± SD | 43,108 | 9.9 ± 12.8 | 21,526 | 10.0 ± 12.9 | 21,582 | 9.8 ± 12.7 | 0.143 |
| Smoking amount (pack/day), mean ± SD | 43,107 | 0.9 ± 1.1 | 21,534 | 0.9 ± 1.1 | 21,573 | 0.9 ± 1.1 | 0.238 |
| Alcohol drinking | |||||||
| Regular alcohol drinking | 43,777 | 21,868 | 21,909 | 0.719 | |||
| Yes, n (%) | 19,814 (45.3) | 9,879 (45.2) | 9,935 (45.4) | ||||
| No, n (%) | 23,963 (54.7) | 1,1989 (54.8) | 11,974 (54.6) | ||||
| Drinking duration (year), mean ± SD | 26,395 | 23.8 ± 10.1 | 13,250 | 23.8 ± 10.2 | 13,145 | 23.8 ± 10.0 | 0.551 |
| Drinking frequency (/week), mean ± SD | 40,171 | 20,096 | 20,075 | ||||
| No drinking, n (%) | 19,814 (49.3) | 9,879 (49.2) | 9,935 (49.5) | 0.843 | |||
| Once a week, n (%) | 3,268 (8.1) | 1,671 (8.3) | 1,597 (8.0) | ||||
| 2–3 times per month, n (%) | 5,792 (14.4) | 2,897 (14.4) | 2,895 (14.4) | ||||
| 1–2 times per week, n (%) | 6,730 (16.8) | 3,344 (16.6) | 3,386 (16.9) | ||||
| 3–4 times per week, n (%) | 3,436 (8.6) | 1,737 (8.6) | 1,699 (8.5) | ||||
| 5–6 times per week, n (%) | 779 (1.9) | 411 (2.0) | 368 (1.8) | ||||
| Everyday, n (%) | 352 (0.9) | 157 (0.8) | 195 (1.0) | ||||
| Drinking amount at one (bottle), mean ± SD | 40,027 | 1.2 ± 1.4 | 20,028 | 1.2 ± 1.4 | 19,999 | 1.2 ± 1.4 | 0.782 |
| Physical activity | |||||||
| Type of physical activities | 29,150 | 14,558 | 14,592 | 0.132 | |||
| Strenuous activities, n (%) | 2,148 (7.4) | 1,097 (7.5) | 1,051 (7.2) | ||||
| Moderate activities, n (%) | 6,719 (23.1) | 3,299 (22.7) | 3,420 (23.4) | ||||
| Mild activities, n (%) | 17,895 (61.4) | 8,930 (61.3) | 8,965 (61.4) | ||||
| None, n (%) | 2,388 (8.2) | 1,232 (8.5) | 1,156 (7.9) | ||||
| Physical activity frequency (/week), mean ± SD | 28,510 | 2.8 ± 0.9 | 14,267 | 2.79 ± 0.88 | 14,243 | 2.80 ± 0.87 | 0.121 |
| Physical activity duration (minutes), mean ± SD | 28,663 | 36.8 ± 11.6 | 14,337 | 36.74 ± 11.62 | 14,326 | 36.82 ± 11.49 | 0.562 |
| Co-morbidity | |||||||
| Hypertension, n (%) | 6,545 (13.2) | 3,325 (13.5) | 3,220 (13.0) | 0.166 | |||
| Diabetes, n (%) | 1,917 (3.9) | 932 (3.8) | 985 (5.0) | 0.216 | |||
| Hyperlipidemia, n (%) | 1,941 (3.9) | 932 (3.8) | 1,009 (4.1) | 0.355 | |||
| Aspirin use | |||||||
| Regular use, n (%) | 2,612 (5.3) | 1,336 (5.4) | 1,276 (5.2) | 0.229 | |||
| No use, n (%) | 46,838 (94.7) | 23,390 (94.6) | 23,448 (94.8) | ||||
| Laboratory measurement | |||||||
| Hemoglobin, mean ± SD | 49,136 | 14.3 ± 31.5 | 24,556 | 14.3 ± 1.6 | 24,580 | 14.3 ± 1.5 | 0.601 |
| Hematocrit, mean ± SD | 49,136 | 42.4 ± 34.2 | 24,556 | 42.4 ± 4.2 | 24,580 | 42.4 ± 4.2 | 0.463 |
| Platelet, mean ± SD | 49,136 | 234.9 ± 52.3 | 24,556 | 234.7 ± 51.9 | 24,580 | 235.1 ± 52.8 | 0.421 |
| Prothrombine time (INR) | 46,820 | 1.0 ± 0.1 | 23,404 | 1.0 ± 0.1 | 23,416 | 1.0 ± 0.1 | 0.537 |
| Total_protein | 49,137 | 7.1 ± 0.4 | 24,559 | 7.1 ± 0.4 | 24,578 | 7.1 ± 0.4 | 0.051 |
| Albumin | 49,137 | 4.5 ± 0.3 | 24,559 | 4.5 ± 0.3 | 24,578 | 4.5 ± 0.3 | 0.419 |
| Total bilirubin, mean ± SD | 49,137 | 0.9 ± 0.4 | 24,559 | 0.9 ± 0.4 | 24,578 | 0.9 ± 0.4 | 0.507 |
| Aspartate transaminase | 49,142 | 26.1 ± 16.1 | 24,561 | 26.2 ± 16.6 | 24,581 | 26.0 ± 15.6 | 0.233 |
| Alanine transaminase | 49,142 | 26.4 ± 24.6 | 24,561 | 26.5 ± 24.6 | 24,581 | 26.3 ± 24.5 | 0.292 |
| Alkaline phosphatase | 49,137 | 63.3 ± 18.3 | 24,558 | 63.6 ± 18.7 | 24,579 | 62.9 ± 17.9 | 0.001 |
| γ-glutamyltransferase, mean ± SD | 48,603 | 33.9 ± 44.8 | 24,295 | 34.1 ± 47.0 | 24,308 | 33.7 ± 42.6 | 0.268 |
| Uric acid, mean ± SD | 49,129 | 5.2 ± 1.4 | 24,554 | 5.2 ± 1.4 | 24,575 | 5.2 ± 1.4 | 0.001 |
| Blood urea nitrogen | 49,135 | 13.3 ± 3.4 | 24,559 | 13.3 ± 3.4 | 24,576 | 13.3 ± 3.4 | 0.499 |
| Creatinine | 49,138 | 0.9 ± 0.2 | 24,560 | 0.9 ± 0.2 | 24,578 | 0.9 ± 0.2 | 0.182 |
| Fasting glucose, mean ± SD | 49,146 | 93.7 ± 18.0 | 24,563 | 93.7 ± 17.7 | 24,583 | 93.8 ± 18.3 | 0.625 |
| Hemoglobin a1c, mean ± SD | 47,575 | 5.6 ± 0.7 | 23,790 | 5.6 ± 0.7 | 23,785 | 5.6 ± 0.7 | 0.916 |
| Insulin | 34,601 | 7.4 ± 4.4 | 17,337 | 7.4 ± 4.5 | 17,264 | 7.3 ± 4.3 | 0.206 |
| C-peptide | 34,602 | 1.7 ± 0.8 | 17,337 | 1.7 ± 0.8 | 17,265 | 1.7 ± 0.8 | 0.174 |
| Total cholesterol, mean ± SD | 49,153 | 196.5 ± 34.7 | 24,569 | 196.7 ± 34.8 | 24,584 | 196.3 ± 34.5 | 0.194 |
| Triglyceride, mean ± SD | 48,757 | 119.0 ± 76.0 | 24,377 | 119.2 ± 75.6 | 24,380 | 118.7 ± 76.5 | 0.501 |
| HDL-cholesterol, mean ± SD | 48,755 | 55.9 ± 14.6 | 24,376 | 55.4 ± 14.7 | 24,379 | 55.6 ± 14.6 | 0.120 |
| LDL- cholesterol, mean ± SD | 48,759 | 123.9 ± 31.1 | 24,378 | 124.2 ± 31.2 | 24,381 | 123.7 ± 31.0 | 0.117 |
| C-reactive protein, mean ± SD | 43,613 | 0.1 ± 0.3 | 21,800 | 0.1 ± 0.3 | 21,813 | 0.1 ± 0.3 | 0.073 |
| Calcium, mean ± SD | 49,128 | 9.2 ± 0.4 | 24,554 | 9.2 ± 0.4 | 24,574 | 9.2 ± 0.4 | 0.755 |
| Ferritin, mean ± SD | 40,343 | 121.1 ± 120.2 | 20,211 | 122.4 ± 126.5 | 20,132 | 119.7 ± 113.6 | 0.089 |
| Colonoscopic and pathologic finding of enrolled patients | |||||||
| No adenoma | 34,734 | 70.2% | 17,377 | 70.3% | 17,357 | 70.2% | 0.855 |
| Serrated polyp | 5,868 | 11.9% | 2,916 | 11.8% | 2,952 | 11.9% | 0.614 |
| Any adenoma | 14,716 | 29.8% | 7,349 | 29.7% | 7,367 | 29.8% | 0.855 |
| Number of adenomas | |||||||
| 1 or 2 | 12,251 | 24.8% | 6,084 | 24.6% | 6,167 | 24.9% | 0.319 |
| ≥3 | 2,465 | 5.0% | 1,265 | 5.1% | 1,200 | 4.9% | |
| Size of the largest adenoma | |||||||
| ≤10 mm | 14,002 | 28.3% | 6,994 | 28.3% | 7,008 | 28.3% | 0.976 |
| >10 mm | 714 | 1.4% | 355 | 1.4% | 359 | 1.5% | |
| Histology of adenoma | |||||||
| Tubular adenoma | 14,586 | 29.5% | 7,279 | 29.4% | 7,307 | 29.6% | 0.659 |
| Tubulovillous or villous adenoma | 130 | 0.3% | 70 | 0.3% | 60 | 0.2% | |
| Dysplasia grade | |||||||
| Low-grade dysplasia | 14,511 | 29.3% | 7,251 | 29.3% | 7,260 | 29.4% | 0.928 |
| High-grade dysplasia | 125 | 0.3% | 59 | 0.2% | 66 | 0.1% | |
| Non-advanced adenoma | 13,691 | 27.7% | 6,836 | 27.7% | 6,855 | 27.7% | 0.981 |
| Advanced adenoma | 989 | 2.0% | 495 | 2.0% | 494 | 2.0% | 0.975 |
| Invasive cancer | 92 | 0.2% | 45 | 0.2% | 47 | 0.2% | 0.834 |
| Advanced neoplasia | 1,025 | 2.1% | 513 | 2.1% | 512 | 2.1% | 0.976 |
*Measured by bioelectrical impedance device
†Type of physical activities have done for the last 7 days including recreation, exercise, sports activities, activities at the work
- strenuous activities—ex) labor, aerobics, fast running bicycle, jogging, soccer- moderate activities—ex) a quick step, swimming, mountain climbing, four-up tennis- mild activities—ex) walking, golf, household-chores- none—I do not even walk for 10 m
§Advanced adenoma was defined as adenoma with villous histology, high-grade dysplasia, or size >10 mm
¶Advanced neoplasia was referred to advanced adenoma and invasive cancer
Stepwise logistic regression for predicting patients with advanced colorectal neoplasia among individuals who underwent their first colonoscopy.
| 1. Predictor selection for advanced neoplasia using the imputed training set | |||
| Parameter | Estimate | Standard Error | |
| Intercept | -8.282 | 0.394 | < .001 |
| Uric acid | 0.062 | 0.039 | .110 |
| γ-Glutamyltransferase | 0.001 | 0.001 | .035 |
| Smoking duration | 0.015 | 0.004 | < .001 |
| Drinking duration | 0.010 | 0.007 | .131 |
| Drinking frequency | 0.082 | 0.030 | .007 |
| Aspirin use | -0.299 | 0.096 | .002 |
| Diabetes | 0.225 | 0.110 | .041 |
| Gender | 0.122 | 0.069 | .075 |
| Age | 0.065 | 0.006 | < .001 |
| 2. Predictor refining for advanced neoplasia using the complete training set: variables with a | |||
| Parameter | Estimate | Standard Error | |
| Intercept | -8.720 | 0.515 | < .001 |
| Uric acid | 0.073 | 0.049 | .140 |
| γ-Glutamyltransferase | 0.001 | 0.001 | .026 |
| Smoking duration | 0.015 | 0.005 | .002 |
| Drinking duration | 0.005 | 0.009 | .538 |
| Drinking frequency | 0.089 | 0.035 | .011 |
| Aspirin use | -0.192 | 0.111 | .082 |
| Gender | 0.138 | 0.123 | .261 |
| Age | 0.071 | 0.009 | < .001 |
| Parameter | Estimate | Standard Error | |
| Intercept | -8.710 | 0.432 | <001 |
| Uric acid | 0.050 | 0.044 | .253 |
| γ-Glutamyltransferase | 0.001 | 0.001 | .024 |
| Smoking duration | 0.015 | 0.004 | .001 |
| Drinking frequency | 0.095 | 0.031 | .002 |
| Aspirin use | -0.288 | 0.104 | .006 |
| Gender | 0.153 | 0.083 | .064 |
| Age | 0.074 | 0.006 | < .001 |
| 3. Prediction model for advanced neoplasia (Model 1) | |||
| Parameter | Estimate | Standard Error | |
| Intercept | -8.428 | 0.354 | < .001 |
| γ-Glutamyltransferase | 0.002 | 0.001 | .016 |
| Smoking duration | 0.015 | 0.004 | .001 |
| Drinking frequency | 0.094 | 0.031 | .002 |
| Aspirin use | -0.286 | 0.104 | .006 |
| Gender | 0.190 | 0.076 | .012 |
| Age | 0.074 | 0.006 | < .001 |
| Parameter | Estimate | Standard Error | |
| Intercept | -8.390 | 0.350 | <. 001 |
| Smoking duration | 0.015 | 0.004 | < .001 |
| Drinking frequency | 0.100 | 0.031 | .001 |
| Aspirin use | -0.289 | 0.104 | .006 |
| Gender | 0.205 | 0.075 | .007 |
| Age | 0.074 | 0.006 | < .001 |
Fig 5Model performance.
Area under the receiver operating curve (AUC) was calculated to evaluate the discrimination power between the training set (line) and validation set (dot) in prediction model 1 (A) and model 2 (B).
Fig 6Model calibration.
Cut-off values to discriminate between the high- and low-risk groups for advanced colorectal neoplasia were set at the point between the sixth and seventh deciles based on the risk of advanced colorectal neoplasia.
Model calibration and estimation of cut-off value for discrimination between high- and low-risk for advanced colorectal neoplasia (CRN).
| Decile of predicted risk | Training set | Validation set | Risk group | |||
|---|---|---|---|---|---|---|
| N | Prevalence of advanced CRN (%) | N | Prevalence of advanced CRN (%) | |||
| 1 | 1922 | 0.260 | 1901 | 0.316 | 1.067 | Low-risk group |
| 2 | 1922 | 0.520 | 2064 | 0.581 | ||
| 3 | 1922 | 0.937 | 1983 | 1.261 | ||
| 4 | 1922 | 1.145 | 2079 | 1.058 | ||
| 5 | 1922 | 1.197 | 1715 | 1.808 | ||
| 6 | 1922 | 1.509 | 1871 | 1.497 | ||
| 1922 | 2.445 | 1960 | 2.449 | 3.955 | High-risk group | |
| 8 | 1922 | 2.653 | 1867 | 3.267 | ||
| 9 | 1922 | 4.214 | 1873 | 3.951 | ||
| 10 | 1929 | 5.962 | 1885 | 6.207 | ||
Discrimination ability of the low-risk group from the high-risk group for advanced colorectal neoplasia.
| Advanced CRN (-), n | Advanced CRN (+), n | Sensitivity, | Specificity, | Accuracy, | PPV, | NPV, | ||
|---|---|---|---|---|---|---|---|---|
| Training set | ||||||||
| Low-risk group, n | 11491 | 107 | <.001 | 73.3 | 61.0 | 61.3 | 3.9 | 99.1 |
| High-risk group, n | 7335 | 294 | ||||||
| Validation set | ||||||||
| Low-risk group, n | 11487 | 124 | <.001 | 70.8 | 61.2 | 61.4 | 4.0 | 98.9 |
| High-risk group, n | 7288 | 300 | ||||||
| Total dataset | ||||||||
| Low-risk group, n | 22978 | 231 | <.001 | 72.0 | 61.1 | 61.3 | 3.9 | 99.0 |
| High-risk group, n | 14623 | 594 | ||||||
CRN, colorectal neoplasia; PPV, positive predictive value; NPV, negative predictive value.
Fig 7Comparison of the discrimination performance of the final model with previous published prediction models for advanced colorectal neoplasia.