Literature DB >> 28841657

A prediction model for advanced colorectal neoplasia in an asymptomatic screening population.

Sung Noh Hong1, Hee Jung Son1,2, Sun Kyu Choi3, Dong Kyung Chang1, Young-Ho Kim1, Sin-Ho Jung3,4, Poong-Lyul Rhee1.   

Abstract

BACKGROUND: An electronic medical record (EMR) database of a large unselected population who received screening colonoscopies may minimize sampling error and represent real-world estimates of risk for screening target lesions of advanced colorectal neoplasia (CRN). Our aim was to develop and validate a prediction model for assessing the probability of advanced CRN using a clinical data warehouse.
METHODS: A total of 49,450 screenees underwent their first colonoscopy as part of a health check-up from 2002 to 2012 at Samsung Medical Center, and the dataset was constructed by means of natural language processing from the computerized EMR system. The screenees were randomized into training and validation sets. The prediction model was developed using logistic regression. The model performance was validated and compared with existing models using area under receiver operating curve (AUC) analysis.
RESULTS: In the training set, age, gender, smoking duration, drinking frequency, and aspirin use were identified as independent predictors for advanced CRN (adjusted P < .01). The developed model had good discrimination (AUC = 0.726) and was internally validated (AUC = 0.713). The high-risk group had a 3.7-fold increased risk of advanced CRN compared to the low-risk group (1.1% vs. 4.0%, P < .001). The discrimination performance of the present model for high-risk patients with advanced CRN was better than that of the Asia-Pacific Colorectal Screening score (AUC = 0.678, P < .001) and Schroy's CAN index (AUC = 0.672, P < .001).
CONCLUSION: The present 5-item risk model can be calculated readily using a simple questionnaire and can identify the low- and high-risk groups of advanced CRN at the first screening colonoscopy. This model may increase colorectal cancer risk awareness and assist healthcare providers in encouraging the high-risk group to undergo a colonoscopy.

Entities:  

Mesh:

Year:  2017        PMID: 28841657      PMCID: PMC5571924          DOI: 10.1371/journal.pone.0181040

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Colorectal cancer (CRC) is the third most common cancer in the world [1]. A colonoscopy is considered the preferred CRC screening modality [2]; however, adherence is generally not sufficient [3]. One of the barriers to CRC screening is a lack of perceived risk among the patients and primary care providers [4]. Risk stratification provides a rational strategy for facilitating appropriate CRC screening and can improve the distribution of resources. A prerequisite for this risk stratification approach is the accessibility of a precise risk assessment tool. Although several risk prediction models for screening target lesions, CRC, or advanced colorectal neoplasia (CRN) have been developed [4-14], previous models have some limitations, such as the lack of inclusion of possible risk factors [4, 5, 7, 9–11, 13]. In the current healthcare system, electronic medical records (EMRs) encompass a plethora of data related with patients, such as demographics, vital signs, medical history, medication, laboratory test results, results from laboratory and imaging studies. The use of medical data mining and correlational studies using EMRs could serve as a valuable resource to aid the determination of unrevealed risk factors under deductive assumptions to establish a real-world prediction model for advanced CRN. However, EMR data contains unstructured data, such as endoscopic and pathology reports, which requires laborious efforts for transforming text to numerical data [15]. Recent advanced natural language processing algorithms, such as the Concept Extraction-based Text Analysis System (CETAS), are able to transform information from endoscopic and pathology reports to a numerical dataset. In this study, we constructed a database using the EMR data from 49,450 patients who underwent their first screening colonoscopy as part of routine health check-up examinations by means of the CETAS and developed a risk prediction model to identify individuals at high risk of advanced CRN.

Methods

Study population

The Center for Health Promotion of Samsung Medical Center, Seoul, Republic of Korea, provides regular health screening examinations that include a colonoscopy [16]. We included consecutive subjects who underwent a screening colonoscopy during health screening examinations at the Center for Health Promotion between January 2003 and December 2012. Regular routine health screening is very common in Korea due to the Industrial Safety and Health Law [16, 17]. The health screening examinations were performed as described previously [16]. All participants completed a questionnaire and received a detailed physical examination as part of the screening program. Self-administered questionnaire data were used to identify current smoking status, alcohol drinking frequency, physical activity, family history of colon cancer, history of colorectal polyps/cancer, comorbidities, and regular use of aspirin. Participants were asked to fast for at least 12 hours and to avoid smoking on the morning of the examination. Blood samples were collected on the day of the colonoscopy. Serum biochemical tests were carried out using an automatic analyzer at the Department of Laboratory Medicine at Samsung Medical Center.

Screening colonoscopies

All colonoscopies were performed by board-certified endoscopists. During colonoscopy, the location, size, number, and appearance of CRN were recorded. The location was assessed by the endoscopists, and the size was estimated using open biopsy forceps. The gross appearance of each lesion was classified using Paris endoscopic classification [18]. All of the colorectal lesions were histologically evaluated and classified according to the World Health Organization classification [19]. However, because the colonoscopy and pathology reports were described by the performing endoscopists and pathologists, the natural language for describing the lesions was different in each report despite using standardized terms. For example, even though the information was the same, the endoscopists used different units, such as cm or mm, and various modifiers, such as elevated, raised, upraised, protruded, and bulged. Therefore, the reports were considered to contain unstructured data, and it was difficult to extract unified forms of variables in real practice.

Data collection

This study used only de-identified medical records that were collected for administrative or clinical purposes as part of routine health screening examinations in the Center for Health Promotion of Samsung Medical Center. The Center for Health Promotion provides researchers de-identified information for biomedical research, which was approved by the Institutional Review Board of Samsung Medical Center for studies that investigate decision-making and the relationships and potential patterns between disease progression and management. The EMRs included both structured and unstructured data. Structured data refer to information that was organized in a row-column database including demographics, physical measurement, smoking, alcohol drinking, physical activity, co-morbidities, aspirin use, and laboratory biochemical measurements. Unstructured data refers to information that does not reside in a traditional row-column database including the free text from colonoscopy and pathology reports. This study was approved by the Institutional Review Board of Samsung Medical Center, which waived the requirement for informed consent because the researchers only obtained de-identified routinely collected data from the institution's clinical data warehouse.

Unstructured text data analysis: Concept Extraction-based Text Analysis System (CETAS)

Among the data collected in this study, we obtained the data about the number and size related to CRN from the free text of the colonoscopy reports, and the histology and dysplasia grade related to CRN from the free text of the pathology reports. Unstructured data were transformed from text to numerical data by the CETAS. The CETAS is based on SAS Enterprise Contents Categorization 12.2 (SAS Institution; Cary, NC, USA), and it does not have add-on modules, such as text mining. SAS ECC 12.2 is an NLP solution that is separate from SAS Base; since it has a built-in LITI (Language Interface Taxonomy Interface) for performing Concept Extraction in a simple and effective manner in the text, it offers a solution that enables rule-based construction of the matching of terms and extraction (Fig 1) [15, 20].
Fig 1

Process diagram of a Concept Extraction-based Text Analysis System.

1. Concept dictionary

In order to create a Concept Dictionary that underlies the configuration of the Concept Extraction Rule, by extracting about 500 colonoscopy tests and pathologic result reports in a random sampling manner, the terms that represent the information for the colorectal polyps which is the number, location, size, histology, dysplasia grade through a natural language processing methodology are organized and cleansed. In this process, in order to determine, the standard terminology of non-standard terms included in the EMR database, a Concept Dictionary was configured by referencing the SNOMED 3.x.

2. Preprocess

A pretreatment process for changing the colonoscopy reports created in different sentence structures by each endoscopists into coherent sentence structures is constructed (Fig 2). The Preprocessing is comprised of two operations. Task1 is composed of functions that delete or change the special symbols that became non-standardized special symbols, such as Bullet Mark, comma, line breaks, spacing, etc.; through which, a base that can perform Concept Extraction based on the special symbols in the natural language processing steps was created. Task2 performs the task of dividing the text into sentences or syntactical units in accordance with the period, line breaks, conjunctions, prepositions, etc. With this, in the case where the text about organs and lesions is expressed through a number of sentences and paragraphs, the error of generating results in conjunction with the information of sentence 1 and sentence 2 that are not related in terms of processing the natural language can be reduced.
Fig 2

Text pre-processing.

3. Concept extraction

As a natural language processing step, categories are configured in accordance with the hierarchical structure of the colorectal structures and lesions, and the Concept Extraction Rule was developed for each category. The effectiveness for Rule-based natural language processing has already been frequently proven in the previous studies.[21, 22] Concept Extraction is performed through the Rule that extracts the terms stored in the Concept Dictionary that is within the particular words expressed in sentences or within the Keyword Count specified by special symbols. As such, the researchers did not have to change the Rule when adding/removing terms; they simply had to change the Concept Dictionary and that automatically changed the results of the Concept Extraction (Fig 3).
Fig 3

Concept extraction process.

4. Validation

Through a random sampling of 500 colonoscopy and pathology result reports from the clinical data warehouse of Samsung Medical Center, the first ever Concept Extraction Rule was created, and by expanding the sample size to 2,000, the precision of the Rule was increased. After the Concept Extraction Rule was created, the accuracy was verified by comparing the Concept Extraction results and the results manually generated. The accuracy can be verified by the following two indicators. Precision means the ability to find the correct data when it is in the colonoscopy and pathology result reports, and Recall means the ability to find only the correct data when it is in the colonoscopy and pathology result reports to the meaning. In order to verify the accuracy of the Concept Extraction Rule built in the CETAS, finally, Precision and Recall were calculated through a random sampling of 50 specimens from the 3 months, and the Precision and Recall were each 99.27% and 99.83%, respectively (Table 1). After validation, unstructured data from remaining 48950 colonoscopy and pathology was transformed text to numerical data.
Table 1

Comparison of concept extraction results and manual data extraction.

CategoryPrecision (%)Recall (%)CategoryPrecision (%)Recall (%)
PAST MEDICAL HISTORY100.00100.00LESION100.00100.00
ANTITHROMBOTICS100.00100.00ABNORMALITY100.00100.00
FAMILY HISTORY OF CANCER100.00100.00HISTOLOGICAL CLASSIFICATIONADJ100.00100.00
LAST COLONOSCOPY100.00100.00HISTOLOGIC TYPE100.00100.00
INDICATION100.0097.75TUMOR GRADING100.00100.00
SEDATION100.00100.00SIZE92.0598.78
MIDAZOLAM100.00100.00NUMBER100.00100.00
PETHIDINE100.00100.00SHAPE100.00100.00
LEVEL OF SEDATION100.0098.88COLOR100.00100.00
PARADOXICAL RESPONSE100.00100.00VIDEO95.51100.00
ANTISPASMODICS100.00100.00SLIDE95.51100.00
CIMETROPIUM100.00100.00ORGAN100.00100.00
DIGITAL RECTAL EXAMINATION96.63100.00BIOPSY100.00100.00
BOWEL PREPARATION100.00100.00BIOPSY STATUS98.88100.00
CECAL INBUTIONTIME100.0097.75BIOPSY METHOD93.26100.00
WITHDRAWAL TIME100.0097.75SUBMUCOSAL INJECTION100.00100.00
INSERTED UPTO100.0098.88HEMOSTASIS100.00100.00
ORGAN100.00100.00DIAGNOSIS100.00100.00
SITE98.88100.00IMPRESSION100.00100.00
Total99.2799.83

Study design

We performed a cross-sectional analysis of patients ≥ 20 years of age who underwent their first screening colonoscopy. The exclusion criteria were as follows: 1) incomplete colonoscopy, 2) poor (semisolid stool that could not be suctioned or washed away and less than 90% of surface seen) and inadequate (repeat preparation and colonoscopy needed) bowel preparation, 3) incomplete colonoscopy report about the number and size related to CRN, 4) incomplete pathology report about the histology and dysplasia grade related to CRN, 5) history of previous colonoscopy, 6) history of colorectal polyps, cancer, or surgery, and 7) inflammatory bowel disease.

Definition of outcome measurement

An advanced CRN was defined as a cancer or adenoma that was at least 10 mm in diameter and had high-grade dysplasia, villous or tubulovillous histological characteristics, or any combination thereof [23]. For patients with multiple neoplasms, the size and appearance of the neoplasms with advanced pathology or of the largest polyp were reported. The main outcome measurement in this study is an advanced CRN detected by means of a colonoscopy and evaluated pathologically.

Prediction model

Structured data and unstructured data transformed from text to numerical data using the CETAS were used as the input variables of the prediction model. The enrolled subjects were randomly partitioned into a training set and a validation set using a 50–50 allocation. Candidate predictors with P < .10 in univariate analyses were included in the multivariable logistic regression. Backward selection was used to remove variables with not significant (P < .05) contributions to the multivariable model fit. Two prediction models were fitted. The first one used both inquiry and lab variables, and the second only used inquiry variables.

Model performance and calibration

A two-sided alpha of 5% was used as insertion and deletion criteria of the two-stage variable selection in fitting a prediction model (i.e., training). The prediction score from the fitted prediction model was applied to the validation set, and the performance of the prediction model was evaluated using area under receiver operating curve (AUC) analysis. Models with a AUC near 1 suggest excellent predictive ability, and an AUC near 0.5 indicates hardly any predictive ability. The calibration is a measure of how accurately the predicted probabilities of advanced CRN inferred from the training set match the subsequently observed event rate in the validation set. The negative predictive value (NPV) is the probability that a patient who is termed “no disease” by the risk score really has no disease. We want this probability to be very high (at least 99%) so as not to miss any significant disease. A cutoff value for the trained risk score was identified and shown to have over 99% negative predictive value when applied to the test set and the combined data set.

Results

A total of 70,959 consecutive subjects underwent screening colonoscopy during health screening examinations at the Center for Health Promotion. We excluded 21,509 subjects who had incomplete or unsuitable reports for text analysis; poor bowel preparation; incomplete colonoscopy; or history of previous colonoscopy, colorectal polyps, cancer, or surgery, or inflammatory bowel disease. For subjects who underwent multiple colonoscopies, we selected the first colonoscopy for the present analysis. Finally, this study used only de-identified data from 49,450 participants who underwent their first screening colonoscopy and a health check-up. A flow diagram of the study population is shown in Fig 4. Of the eligible 49,450 patients who underwent their first screening colonoscopy, 27,688 were male (55.99%) and 21,762 were female (44.01%), all were Korean, and the mean age was 49.86 ± 9.33 years. One or more colorectal adenomas were found in 14,716 (29.8%) patients, 1,025 (2.1%) of whom had advanced adenoma, and 92 of whom had invasive cancer (0.2%). The overall prevalence of advanced CRN was 2.3%. The clinical characteristics of the enrolled participants are listed in Table 2. Enrolled participants were randomly divided into training and validation sets using a 50–50 allocation.
Fig 4

Flow diagram of the study population.

Table 2

Clinical characteristics of enrolled subjects.

VariableTotalTraining setValidation setp
NValueNValueNValue
Demographics
 Age (years), mean ± SD49,45049.9 ± 9.324,72649.9 ± 9.424,72449.8 ± 9.30.160
 Sex49,45024,72624,7240.008
  Female, n (%)21,762 (44.0)10,735 (43.4)11,027 (44.6)
  Male, n (%)27,688 (56.0)13,991 (56.6)13,697 (55.4)
 Family history of colorectal cancer45,58322,75921,7060.694
  Yes, n (%)2,251 (4.9)1,133 (5.0)1,118 (4.9)
  No, n (%)43,332(95.1)21,626 (95.0)21,706 (95.1)
Physical measurement
 Body mass index, mean ± SD44,58123.7 ± 3.122,27523.7 ± 3.122,30623.6 ± 3.10.410
 Waist circumference (cm), mean ± SD44,14583.4 ± 44.322,05783.7 ± 62.022,08883.2 ± 9.10.229
 Body fat percentage* (%), mean ± SD49,05825.5 ± 6.624,51725.4 ± 6.524,54125.5 ± 6.60.813
Cigarette smoking
 Smoking status42,57921,27121,3080.319
  Non-smoker, n (%)23,841 (56.0)11,838 (55.7)12,003 (56.4)
  Ex-smoker, n (%)5,260 (12.3)2,631 (12.3)2,629 (12.3)
  Current smoker, n (%)13,478 (31.7)6,802 (32.0)6,676 (31.3)
 Smoking duration (year), mean ± SD43,1089.9 ± 12.821,52610.0 ± 12.921,5829.8 ± 12.70.143
 Smoking amount (pack/day), mean ± SD43,1070.9 ± 1.121,5340.9 ± 1.121,5730.9 ± 1.10.238
Alcohol drinking
 Regular alcohol drinking43,77721,86821,9090.719
  Yes, n (%)19,814 (45.3)9,879 (45.2)9,935 (45.4)
  No, n (%)23,963 (54.7)1,1989 (54.8)11,974 (54.6)
 Drinking duration (year), mean ± SD26,39523.8 ± 10.113,25023.8 ± 10.213,14523.8 ± 10.00.551
 Drinking frequency (/week), mean ± SD40,17120,09620,075
  No drinking, n (%)19,814 (49.3)9,879 (49.2)9,935 (49.5)0.843
  Once a week, n (%)3,268 (8.1)1,671 (8.3)1,597 (8.0)
  2–3 times per month, n (%)5,792 (14.4)2,897 (14.4)2,895 (14.4)
  1–2 times per week, n (%)6,730 (16.8)3,344 (16.6)3,386 (16.9)
  3–4 times per week, n (%)3,436 (8.6)1,737 (8.6)1,699 (8.5)
  5–6 times per week, n (%)779 (1.9)411 (2.0)368 (1.8)
  Everyday, n (%)352 (0.9)157 (0.8)195 (1.0)
 Drinking amount at one (bottle), mean ± SD40,0271.2 ± 1.420,0281.2 ± 1.419,9991.2 ± 1.40.782
Physical activity
 Type of physical activities29,15014,55814,5920.132
  Strenuous activities, n (%)2,148 (7.4)1,097 (7.5)1,051 (7.2)
  Moderate activities, n (%)6,719 (23.1)3,299 (22.7)3,420 (23.4)
  Mild activities, n (%)17,895 (61.4)8,930 (61.3)8,965 (61.4)
  None, n (%)2,388 (8.2)1,232 (8.5)1,156 (7.9)
 Physical activity frequency (/week), mean ± SD28,5102.8 ± 0.914,2672.79 ± 0.8814,2432.80 ± 0.870.121
 Physical activity duration (minutes), mean ± SD28,66336.8 ± 11.614,33736.74 ± 11.6214,32636.82 ± 11.490.562
Co-morbidity
 Hypertension, n (%)6,545 (13.2)3,325 (13.5)3,220 (13.0)0.166
 Diabetes, n (%)1,917 (3.9)932 (3.8)985 (5.0)0.216
 Hyperlipidemia, n (%)1,941 (3.9)932 (3.8)1,009 (4.1)0.355
Aspirin use
 Regular use, n (%)2,612 (5.3)1,336 (5.4)1,276 (5.2)0.229
 No use, n (%)46,838 (94.7)23,390 (94.6)23,448 (94.8)
Laboratory measurement
 Hemoglobin, mean ± SD49,13614.3 ± 31.524,55614.3 ± 1.624,58014.3 ± 1.50.601
 Hematocrit, mean ± SD49,13642.4 ± 34.224,55642.4 ± 4.224,58042.4 ± 4.20.463
 Platelet, mean ± SD49,136234.9 ± 52.324,556234.7 ± 51.924,580235.1 ± 52.80.421
 Prothrombine time (INR)46,8201.0 ± 0.123,4041.0 ± 0.123,4161.0 ± 0.10.537
 Total_protein49,1377.1 ± 0.424,5597.1 ± 0.424,5787.1 ± 0.40.051
 Albumin49,1374.5 ± 0.324,5594.5 ± 0.324,5784.5 ± 0.30.419
 Total bilirubin, mean ± SD49,1370.9 ± 0.424,5590.9 ± 0.424,5780.9 ± 0.40.507
 Aspartate transaminase49,14226.1 ± 16.124,56126.2 ± 16.624,58126.0 ± 15.60.233
 Alanine transaminase49,14226.4 ± 24.624,56126.5 ± 24.624,58126.3 ± 24.50.292
 Alkaline phosphatase49,13763.3 ± 18.324,55863.6 ± 18.724,57962.9 ± 17.90.001
 γ-glutamyltransferase, mean ± SD48,60333.9 ± 44.824,29534.1 ± 47.024,30833.7 ± 42.60.268
 Uric acid, mean ± SD49,1295.2 ± 1.424,5545.2 ± 1.424,5755.2 ± 1.40.001
 Blood urea nitrogen49,13513.3 ± 3.424,55913.3 ± 3.424,57613.3 ± 3.40.499
 Creatinine49,1380.9 ± 0.224,5600.9 ± 0.224,5780.9 ± 0.20.182
 Fasting glucose, mean ± SD49,14693.7 ± 18.024,56393.7 ± 17.724,58393.8 ± 18.30.625
 Hemoglobin a1c, mean ± SD47,5755.6 ± 0.723,7905.6 ± 0.723,7855.6 ± 0.70.916
 Insulin34,6017.4 ± 4.417,3377.4 ± 4.517,2647.3 ± 4.30.206
 C-peptide34,6021.7 ± 0.817,3371.7 ± 0.817,2651.7 ± 0.80.174
 Total cholesterol, mean ± SD49,153196.5 ± 34.724,569196.7 ± 34.824,584196.3 ± 34.50.194
 Triglyceride, mean ± SD48,757119.0 ± 76.024,377119.2 ± 75.624,380118.7 ± 76.50.501
 HDL-cholesterol, mean ± SD48,75555.9 ± 14.624,37655.4 ± 14.724,37955.6 ± 14.60.120
 LDL- cholesterol, mean ± SD48,759123.9 ± 31.124,378124.2 ± 31.224,381123.7 ± 31.00.117
 C-reactive protein, mean ± SD43,6130.1 ± 0.321,8000.1 ± 0.321,8130.1 ± 0.30.073
 Calcium, mean ± SD49,1289.2 ± 0.424,5549.2 ± 0.424,5749.2 ± 0.40.755
 Ferritin, mean ± SD40,343121.1 ± 120.220,211122.4 ± 126.520,132119.7 ± 113.60.089
Colonoscopic and pathologic finding of enrolled patients
 No adenoma34,73470.2%17,37770.3%17,35770.2%0.855
 Serrated polyp5,86811.9%2,91611.8%2,95211.9%0.614
 Any adenoma14,71629.8%7,34929.7%7,36729.8%0.855
  Number of adenomas
   1 or 212,25124.8%6,08424.6%6,16724.9%0.319
   ≥32,4655.0%1,2655.1%1,2004.9%
  Size of the largest adenoma
   ≤10 mm14,00228.3%6,99428.3%7,00828.3%0.976
   >10 mm7141.4%3551.4%3591.5%
  Histology of adenoma
   Tubular adenoma14,58629.5%7,27929.4%7,30729.6%0.659
   Tubulovillous or villous adenoma1300.3%700.3%600.2%
  Dysplasia grade
   Low-grade dysplasia14,51129.3%7,25129.3%7,26029.4%0.928
   High-grade dysplasia1250.3%590.2%660.1%
 Non-advanced adenoma13,69127.7%6,83627.7%6,85527.7%0.981
 Advanced adenoma§ -9892.0%4952.0%4942.0%0.975
 Invasive cancer920.2%450.2%470.2%0.834
 Advanced neoplasia1,0252.1%5132.1%5122.1%0.976

*Measured by bioelectrical impedance device

†Type of physical activities have done for the last 7 days including recreation, exercise, sports activities, activities at the work

- strenuous activities—ex) labor, aerobics, fast running bicycle, jogging, soccer- moderate activities—ex) a quick step, swimming, mountain climbing, four-up tennis- mild activities—ex) walking, golf, household-chores- none—I do not even walk for 10 m

§Advanced adenoma was defined as adenoma with villous histology, high-grade dysplasia, or size >10 mm

¶Advanced neoplasia was referred to advanced adenoma and invasive cancer

*Measured by bioelectrical impedance device †Type of physical activities have done for the last 7 days including recreation, exercise, sports activities, activities at the work - strenuous activities—ex) labor, aerobics, fast running bicycle, jogging, soccer- moderate activities—ex) a quick step, swimming, mountain climbing, four-up tennis- mild activities—ex) walking, golf, household-chores- none—I do not even walk for 10 m §Advanced adenoma was defined as adenoma with villous histology, high-grade dysplasia, or size >10 mm Advanced neoplasia was referred to advanced adenoma and invasive cancer

Identifying risk predictors and developing a candidate risk prediction model

To identify the patients with advanced CRN among the individuals who underwent their first colonoscopy, a stepwise logistic regression using all available variables listed in Table 1 was conducted for the imputed training set. We identified age, gender, diabetes, aspirin use, smoking duration, alcohol drinking frequency, drinking duration, uric acid, and γ-glutamyltransferase as the potential predictors (Table 3). Predictors for advanced CRN were refined using the complete data from the training set and excluded drinking duration and uric acid due to P-values > 0.3. Finally, age, gender, smoking duration, alcohol drinking frequency, aspirin use, and γ-glutamyltransferase were included in the prediction model (model 1). The prediction score from the refined prediction model 1 was determined by the following equation:
Table 3

Stepwise logistic regression for predicting patients with advanced colorectal neoplasia among individuals who underwent their first colonoscopy.

1. Predictor selection for advanced neoplasia using the imputed training set
ParameterEstimateStandard ErrorP
Intercept-8.2820.394< .001
Uric acid0.0620.039.110
γ-Glutamyltransferase0.0010.001.035
Smoking duration0.0150.004< .001
Drinking duration0.0100.007.131
Drinking frequency0.0820.030.007
Aspirin use-0.2990.096.002
Diabetes0.2250.110.041
Gender0.1220.069.075
Age0.0650.006< .001
2. Predictor refining for advanced neoplasia using the complete training set: variables with a P-value > 0.3 were excluded
ParameterEstimateStandard ErrorP
Intercept-8.7200.515< .001
Uric acid0.0730.049.140
γ-Glutamyltransferase0.0010.001.026
Smoking duration0.0150.005.002
Drinking duration0.0050.009.538
Drinking frequency0.0890.035.011
Aspirin use-0.1920.111.082
Gender0.1380.123.261
Age0.0710.009< .001
ParameterEstimateStandard ErrorP
Intercept-8.7100.432<001
Uric acid0.0500.044.253
γ-Glutamyltransferase0.0010.001.024
Smoking duration0.0150.004.001
Drinking frequency0.0950.031.002
Aspirin use-0.2880.104.006
Gender0.1530.083.064
Age0.0740.006< .001
3. Prediction model for advanced neoplasia (Model 1)
ParameterEstimateStandard ErrorP
Intercept-8.4280.354< .001
γ-Glutamyltransferase0.0020.001.016
Smoking duration0.0150.004.001
Drinking frequency0.0940.031.002
Aspirin use-0.2860.104.006
Gender0.1900.076.012
Age0.0740.006< .001
ParameterEstimateStandard ErrorP
Intercept-8.3900.350<. 001
Smoking duration0.0150.004< .001
Drinking frequency0.1000.031.001
Aspirin use-0.2890.104.006
Gender0.2050.075.007
Age0.0740.006< .001
Among the identified predictors, γ-glutamyltransferase was the only laboratory parameter that requires blood sampling and laboratory costs. When γ-glutamyltransferase was removed from the prediction model, all predictors could be obtained from a simple questionnaire, and a simple 5-item risk index could be readily determined from the questionnaire clinical data. The final prediction model was constructed with age, gender, smoking duration, alcohol drinking frequency, and aspirin use (model 2). The prediction score from the refined prediction models 1 and 2 was determined by the following equation:

Evaluating the performance of the prediction model

Discrimination refers to the ability to separate the variables with events from those without events. Using the prediction models 1 and 2, AUC values were calculated and used to evaluate the discrimination power of the prediction models. The AUC for prediction model 1 was 0.716 for the training set and 0.701 for the validation set (Fig 5A), whereas the AUC for prediction model 2 was 0.726 for the training set and 0.713 for the validation set (Fig 5B). Model 2 showed slightly higher discriminatory ability than model 1, although the risk factors were eliminated. The reason why model 2 was superior to model 1 was that the number of participants included in the calculation was larger in model 2 (training set: n = 18,874, validation set: n = 19,199) than model 1 (training set: n = 18,900, validation set: n = 19,277).
Fig 5

Model performance.

Area under the receiver operating curve (AUC) was calculated to evaluate the discrimination power between the training set (line) and validation set (dot) in prediction model 1 (A) and model 2 (B).

Model performance.

Area under the receiver operating curve (AUC) was calculated to evaluate the discrimination power between the training set (line) and validation set (dot) in prediction model 1 (A) and model 2 (B). Prediction model 2 was selected as the final prediction model for advanced CRN. The calibration is a measure of how accurately the predicted probabilities of advanced CRN inferred from the training set match the subsequently observed event rate in the validation set. The individuals included in the training set were divided into deciles according to predicted risk for advanced CRN. Then, the predicted rate of the training set and observed rates of the validation set in each category were compared (Fig 6), indicating good calibration performance. To improve clinical utilization, cut-off values were set at the point of discrimination between the high- and low-risk group for advanced CRN in simulated calibration charts. Between the sixth and seventh deciles, the risk of advanced CRN increased from 1.51% to 2.45% in the training set and 1.50% to 2.45% in the validation set. The cut-off value of -4.195 was set at this point between the sixth and seventh deciles (Table 4).
Fig 6

Model calibration.

Cut-off values to discriminate between the high- and low-risk groups for advanced colorectal neoplasia were set at the point between the sixth and seventh deciles based on the risk of advanced colorectal neoplasia.

Table 4

Model calibration and estimation of cut-off value for discrimination between high- and low-risk for advanced colorectal neoplasia (CRN).

Decile of predicted riskTraining setValidation setRisk group
NPrevalence of advanced CRN (%)NPrevalence of advanced CRN (%)
119220.26019010.3161.067Low-risk group
219220.52020640.581
319220.93719831.261
419221.14520791.058
519221.19717151.808
619221.50918711.497
719222.44519602.4493.955High-risk group
819222.65318673.267
919224.21418733.951
1019295.96218856.207

Model calibration.

Cut-off values to discriminate between the high- and low-risk groups for advanced colorectal neoplasia were set at the point between the sixth and seventh deciles based on the risk of advanced colorectal neoplasia.

Discrimination of the low-risk group from the high-risk group for advanced CRN

Based on the cut-off value, a simplified prediction model for discrimination of the low-risk group from the high-risk group for advanced CRN was constructed (Table 5). The high-risk group had a 3.7-fold increased risk of advanced CRN compared to the low-risk group (1.1% vs. 4.0%, P < .001). In the training set, the sensitivity, specificity, accuracy, PPV, and negative predictive value (NPV) of the simplified prediction model were 73.3%, 61.0%, 61.3%, 3.9%, and 99.1%, respectively. In the validation set, the sensitivity, specificity, accuracy, PPV, and NPV were 70.8%, 61.2%, 61.4%, 4.0%, and 98.9%, respectively.
Table 5

Discrimination ability of the low-risk group from the high-risk group for advanced colorectal neoplasia.

Advanced CRN (-), nAdvanced CRN (+), npSensitivity,% (95% CI)Specificity,% (95% CI)Accuracy,% (95% CI)PPV,% (95% CI)NPV,% (95% CI)
Training set
 Low-risk group, n11491107<.00173.3(69.0–77.6)61.0(60.3–61.7)61.3(60.6–62.0)3.9(3.4–4.3)99.1(98.9–99.3)
 High-risk group, n7335294
Validation set
 Low-risk group, n11487124<.00170.8(66.4–75.1)61.2(60.5–61.9)61.4(60.7–62.1)4.0(3.5–4.4)98.9(98.7–99.1)
 High-risk group, n7288300
Total dataset
 Low-risk group, n22978231<.00172.0(68.9–75.1)61.1(60.6–61.6)61.3(60.9–61.8)3.9(3.6–4.2)99.0(98.9–99.1)
 High-risk group, n14623594

CRN, colorectal neoplasia; PPV, positive predictive value; NPV, negative predictive value.

CRN, colorectal neoplasia; PPV, positive predictive value; NPV, negative predictive value.

Comparison of the discrimination performance of the final model with previous published prediction models for advanced CRN

In the validation set, the discrimination performance of the final model was compared with that of the advanced CRN (ACN) index [14] and Asia-Pacific Colorectal Screening score (APCS) [12] using the AUC (Fig 7). The AUC of the final model was 0.716 (95% CI, 0.691–0.741), whereas that of the ACN index was 0.672 (95% CI, 0.645–0.699), and that of the APCS was 0.678 (95% CI, 0.651–0.705). The discrimination performance of the developed model for high-risk patients with advanced CRN was better than that of the ACN index (P < .001) or APCS (P < .001).
Fig 7

Comparison of the discrimination performance of the final model with previous published prediction models for advanced colorectal neoplasia.

Discussion

Big data can improve health by providing insights into public health, such as enhanced disease prediction and prevention. Using a big data analytics algorithm, we explored a large health screening examination database. The refined database with structured and unstructured data contained first screening colonoscopy and comprehensive health examination data from 49,450 patients. Big data can not only be applied for verifying alleged associations, but can also be used as a hypothesis-generating machine [24]. In this study, we generated a prediction model for advanced CRN, which might be the first trial for utilization of big data analytics in the field of gastroenterology. The final simplified prediction model was shown to have acceptable discriminative power for patients with advanced CRN. Our simple risk score using easily available information from the patient's clinical questionnaire stratified asymptomatic patients into low- and high-risk groups for advanced CRN before a screening colonoscopy was performed. The discrimination performance of the developed model for high-risk patients with advanced CRN was better than that of existing models. Based on our results, it is deemed to be inefficient to undergo colonoscopy screening for patients in the low-risk group due to the low probability of advanced CRN as well as the cost and risk associated with colonoscopy. The specificity of our prediction model was not sufficiently high, but the NPVs in this prediction model were as high as 99%. Since this study was populated by asymptomatic individuals who underwent health check-ups, and not symptomatic patients, our objective was to develop and validate a prediction model for estimating the probability of having advanced CRN. We hope to apply this proposed prediction model for the purpose of identifying patients who may not need to undergo a colonoscopy. There were many studies reporting different risk scoring system for CRC; however, almost none of them can be translated into clinical practice. It is possible because the fecal occult blood test is in fact very convenient, the result is straightforward, and the cost is low. In Korea, a national CRC screening has been in place using fecal immunochemical testing (FIT). The limitation of a stool-based test such as FIT is that it is a diagnostic tool only for the early detection of CRC. Recent guideline grouped the CRC screening tests into cancer prevention and cancer detection tests [25]. The benefits of cancer prevention test can eliminate advanced CRN and prevent CRC. Cancer prevention tests are preferred over detection tests. The goal of CRC screening shifted from “screening detection to prevention by polypectomy [26].” As such, the present study aimed to develop and validate a prediction model for estimating the probability of having advanced CRN and not CRC. Therefore, we think it is difficult to directly compare the predictive model based on FIT with a colonoscopy. The issue of developing a prediction model for advanced CRN is not novel and several other models already exist. Our study has implemented a predictive model using varied clinical variables acquired in real-world clinical practice. Our prediction model showed more effective prediction for advanced CRN than previous proposed advanced CRN prediction models. We chose to compare our prediction model to studies by Schroy et al [14]. and Yeoh et al [12]. The reason we chose the studies by Schroy et al [14]. and Yeoh et al [12]. is because both studies evaluated advanced CRN predictability and were well designed. The study by Imperidale et al. was also a well-designed study [10], but the outcome measurement was advanced proximal advanced CRN. Therefore, we thought Imperidale’s study was inappropriate for comparison with our model. Our study was performed with a large population who underwent their first colonoscopy and a comprehensive health screening examination, which may minimize sampling error and represent real-world practice and enhances its usefulness in facilitating shared decision-making for individuals who need CRC screening. The use of EMR systems among healthcare providers has spread widely over the past decade [15]. Using text from EMR system, we applied NLP and the CETAS method to demonstrate the replicability of manual chart review. Previous studies have revealed the utility of NLP in extracting information from clinical text [20-22]. In addition, our risk prediction models use extensive independent variables to estimate the probability of having or developing advanced CRN. Therefore, the discrimination performance of our model for high-risk patients with advanced CRN was better than that of existing models. Our study had some limitations. External validation could not be performed, so there are concerns about overfitting and generalizability. In addition, the model was developed using a database of patients willing to undergo screening colonoscopy (It is a selected population of 70,959 subjects who underwent colonoscopy screening. It is furthermore selected once more because 21,509 subjects are excluded from the analysis); on that account, it is unclear whether our model can apply to the patients unable or unwilling to undergo colonoscopy. Our study population was quite young for routine screening colonoscopies. The mean age of study population was 50 years old and this may explain why the overall rate of advanced CRN of 2.3% in this study. However, all included subjects underwent colonoscopy as a part of their health check-up. So, even though the patients were young, they did not have symptoms or a family history of CRC. Furthermore, given the long time needed for an adenoma to progress to a carcinoma, the increased number of cases of CRC diagnosed in this age group may originate from adenomas present in individuals in their 40s or earlier [17]. These cancers may be prevented by colonoscopy with polypectomy of premalignant lesions in the preceding decade. Despite this theoretical argument for screening individuals in their 40s or earlier, we included patients who underwent colonoscopies at any age and analyzed the age as continuous variables to develop a prediction model. In addition, we used the mean substitution technique as imputation to deal with missing predictor values in training set. Mean substitution has the benefit of not changing the sample mean for that variable, however mean imputation attenuates any correlations involving the variables that are imputed. The mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis. Although we used the dataset applied mean substitution technique during univariate logistic regression for the identification of predictors and not used during multivariate logistic regression, the uncertainty in the imputation can lead to overly precise results and errors in our prediction model [27, 28]. Despite these weak points, our model can serve as a clinically useful tool for facilitating shared decision-making related to select the screening modalities for early detection and prevention of CRC, especially when the provider and patient preferences differ. If physicians could predict which patients are at increased risk before colonoscopy, it is possible that they might make better decisions about screening. We developed a simple risk scoring model easily available by questionnaire and precisely identified low- and high-risk groups for advanced CRN at the first screening colonoscopy. This model may increase CRC risk awareness and help healthcare providers encourage the high-risk group to undergo colonoscopy. Furthermore, by identifying the patients with a high risk of advanced CRN, the present model may help to target primary prevention interventions. Once it has been externally validated, the model will be useful to facilitate more effective shared decision-making for CRC screening.
  28 in total

Review 1.  The Paris endoscopic classification of superficial neoplastic lesions: esophagus, stomach, and colon: November 30 to December 1, 2002.

Authors: 
Journal:  Gastrointest Endosc       Date:  2003-12       Impact factor: 9.427

Review 2.  Handling missing data in self-report measures.

Authors:  Susan M Fox-Wasylyshyn; Maher M El-Masri
Journal:  Res Nurs Health       Date:  2005-12       Impact factor: 2.228

3.  Association between markers of glucose metabolism and risk of colorectal adenoma.

Authors:  Sanjay Rampal; Moon Hee Yang; Jidong Sung; Hee Jung Son; Yoon-Ho Choi; Jun Haeng Lee; Young-Ho Kim; Dong Kyung Chang; Poong-Lyul Rhee; Jong Chul Rhee; Eliseo Guallar; Juhee Cho
Journal:  Gastroenterology       Date:  2014-03-14       Impact factor: 22.682

4.  [Korean guidelines for colorectal cancer screening and polyp detection].

Authors:  Bo In Lee; Sung Pil Hong; Seong-Eun Kim; Se Hyung Kim; Hyun-Soo Kim; Sung Noh Hong; Dong-Hoon Yang; Sung Jae Shin; Suck-Ho Lee; Young-Ho Kim; Dong Il Park; Hyun Jung Kim; Suk-Kyun Yang; Hyo Jong Kim; Hae Jeong Jeon
Journal:  Korean J Gastroenterol       Date:  2012-02

5.  The Asia-Pacific Colorectal Screening score: a validated tool that stratifies risk for colorectal advanced neoplasia in asymptomatic Asian subjects.

Authors:  Khay-Guan Yeoh; Khek-Yu Ho; Han-Mo Chiu; Feng Zhu; Jessica Y L Ching; Deng-Chyang Wu; Takahisa Matsuda; Jeong-Sik Byeon; Sang-Kil Lee; Khean-Lee Goh; Jose Sollano; Rungsun Rerknimitr; Rupert Leong; Kelvin Tsoi; Jaw-Town Lin; Joseph J Y Sung
Journal:  Gut       Date:  2011-03-14       Impact factor: 23.059

6.  Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012.

Authors:  Jacques Ferlay; Isabelle Soerjomataram; Rajesh Dikshit; Sultan Eser; Colin Mathers; Marise Rebelo; Donald Maxwell Parkin; David Forman; Freddie Bray
Journal:  Int J Cancer       Date:  2014-10-09       Impact factor: 7.396

7.  Cumulative risk of colon cancer up to age 70 years by risk factor status using data from the Nurses' Health Study.

Authors:  Esther K Wei; Graham A Colditz; Edward L Giovannucci; Charles S Fuchs; Bernard A Rosner
Journal:  Am J Epidemiol       Date:  2009-09-01       Impact factor: 4.897

8.  Development of a risk score for colorectal cancer in men.

Authors:  Jane A Driver; J Michael Gaziano; Rebecca P Gelber; I-Min Lee; Julie E Buring; Tobias Kurth
Journal:  Am J Med       Date:  2007-03       Impact factor: 4.965

9.  A score to estimate the likelihood of detecting advanced colorectal neoplasia at colonoscopy.

Authors:  Michal F Kaminski; Marcin Polkowski; Ewa Kraszewska; Maciej Rupinski; Eugeniusz Butruk; Jaroslaw Regula
Journal:  Gut       Date:  2014-01-02       Impact factor: 23.059

10.  Risk prediction model for colorectal cancer: National Health Insurance Corporation study, Korea.

Authors:  Aesun Shin; Jungnam Joo; Hye-Ryung Yang; Jeongin Bak; Yunjin Park; Jeongseon Kim; Jae Hwan Oh; Byung-Ho Nam
Journal:  PLoS One       Date:  2014-02-12       Impact factor: 3.240

View more
  10 in total

1.  An Automated Feature Engineering for Digital Rectal Examination Documentation using Natural Language Processing.

Authors:  Selen Bozkurt; Jung In Park; Kathleen Mary Kan; Michelle Ferrari; Daniel L Rubin; James D Brooks; Tina Hernandez-Boussard
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

Review 2.  Evolving Role and Future Directions of Natural Language Processing in Gastroenterology.

Authors:  Fredy Nehme; Keith Feldman
Journal:  Dig Dis Sci       Date:  2020-02-27       Impact factor: 3.199

3.  Comparison of multiple statistical models for the development of clinical prediction scores to detect advanced colorectal neoplasms in asymptomatic Thai patients.

Authors:  Kamonwan Soonklang; Boonying Siribumrungwong; Bunchorn Siripongpreeda; Chirayu Auewarakul
Journal:  Medicine (Baltimore)       Date:  2021-05-21       Impact factor: 1.817

4.  Risk Scores for Predicting Advanced Colorectal Neoplasia in the Average-risk Population: A Systematic Review and Meta-analysis.

Authors:  Le Peng; Korbinian Weigl; Daniel Boakye; Hermann Brenner
Journal:  Am J Gastroenterol       Date:  2018-10-12       Impact factor: 10.864

5.  Head-to-Head Comparison of the Performance of 17 Risk Models for Predicting Presence of Advanced Neoplasms in Colorectal Cancer Screening.

Authors:  Le Peng; Yesilda Balavarca; Korbinian Weigl; Michael Hoffmeister; Hermann Brenner
Journal:  Am J Gastroenterol       Date:  2019-09       Impact factor: 10.864

6.  Is it possible to automatically assess pretreatment digital rectal examination documentation using natural language processing? A single-centre retrospective study.

Authors:  Selen Bozkurt; Kathleen M Kan; Michelle K Ferrari; Daniel L Rubin; Douglas W Blayney; Tina Hernandez-Boussard; James D Brooks
Journal:  BMJ Open       Date:  2019-07-18       Impact factor: 2.692

Review 7.  Research and Application of Artificial Intelligence Based on Electronic Health Records of Patients With Cancer: Systematic Review.

Authors:  Xinyu Yang; Dongmei Mu; Hao Peng; Hua Li; Ying Wang; Ping Wang; Yue Wang; Siqi Han
Journal:  JMIR Med Inform       Date:  2022-04-20

8.  Risk factors and prediction algorithm for advanced neoplasia on screening colonoscopy for average-risk individuals.

Authors:  Offir Ukashi; Barak Pflantzer; Yiftach Barash; Eyal Klang; Shlomo Segev; Doron Yablecovitch; Uri Kopylov; Shomron Ben-Horin; Ido Laish
Journal:  Therap Adv Gastroenterol       Date:  2022-06-30       Impact factor: 4.802

9.  External validation of models for predicting risk of colorectal cancer using the China Kadoorie Biobank.

Authors:  Roxanna E Abhari; Blake Thomson; Ling Yang; Iona Millwood; Yu Guo; Xiaoming Yang; Jun Lv; Daniel Avery; Pei Pei; Peng Wen; Canqing Yu; Yiping Chen; Junshi Chen; Liming Li; Zhengming Chen; Christiana Kartsonaki
Journal:  BMC Med       Date:  2022-09-08       Impact factor: 11.150

10.  Risk prediction rule for advanced neoplasia on screening colonoscopy for average-risk individuals.

Authors:  Ala I Sharara; Ali El Mokahal; Ali H Harb; Natalia Khalaf; Fayez S Sarkis; Mustapha M El-Halabi; Nabil M Mansour; Ahmad Malli; Robert Habib
Journal:  World J Gastroenterol       Date:  2020-10-07       Impact factor: 5.742

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.