| Literature DB >> 35436226 |
Hyun Wook Han1,2, Sun Young Yang3, Jung Ho Bae1,2,3, Gyuseon Song1,2, Soonok Sa1,2, Goh Eun Chung3, Ji Yeon Seo3, Eun Hyo Jin3, Heecheon Kim4, DongUk An4.
Abstract
BACKGROUND: Manual data extraction of colonoscopy quality indicators is time and labor intensive. Natural language processing (NLP), a computer-based linguistics technique, can automate the extraction of important clinical information, such as adverse events, from unstructured free-text reports. NLP information extraction can facilitate the optimization of clinical work by helping to improve quality control and patient management.Entities:
Keywords: adenoma; colonoscopy; endoscopy; natural language processing
Year: 2022 PMID: 35436226 PMCID: PMC9055472 DOI: 10.2196/35257
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Data set description and process for the NLP pipeline development and information extraction. NLP: natural language processing.
Figure 2Extraction and summarization process of the NLP pipeline. NLP: natural language process; Y/N: yes/no (indicating presence or absence); Rt: right colon; Lt: left colon.
Characteristics of training and testing data sets for the development of the natural language processing pipeline.
| Characteristics | Training (N=2000) | Testing (N=1000) | |||
| Age, mean (SD) | 58.6 (6.4) | 60.4 (6.5) | <.001 | ||
|
| .86 | ||||
|
| Male, n (%) | 1188 (59.4) | 590 (59.0) |
| |
|
| Female, n (%) | 812 (40.6) | 410 (41.0) |
| |
|
| |||||
|
| Overall, n (%) | 925 (46.2) | 475 (47.5) | .72 | |
|
| Right colon only, n (%) | 501 (25.0) | 265 (26.5) | .54 | |
|
| Left colon only, n (%) | 212 (10.6) | 113 (11.3) | .65 | |
|
| Both, n (%) | 212 (10.6) | 97 (9.7) | .53 | |
|
| |||||
|
| Overall, n (%) | 77 (3.8) | 34 (3.4) | .62 | |
|
| Right colon only, n (%) | 51 (2.6) | 14 (1.4) | .06 | |
|
| Left colon only, n (%) | 24 (1.2) | 18 (1.8) | .26 | |
|
| Both, n (%) | 3 (0.2) | 2 (0.2) | .87 | |
|
| |||||
|
| Overall, n (%) | 121 (6) | 66 (6.6) | .64 | |
|
| Right colon only, n (%) | 79 (4) | 45 (4.5) | .56 | |
|
| Left colon only, n (%) | 34 (1.7) | 15 (1.5) | .80 | |
|
| Both, n (%) | 8 (0.4) | 6 (0.6) | .64 | |
|
| |||||
|
| Overall, n (%) | 19 (1) | 12 (1.2) | .66 | |
|
| Right colon only, n (%) | 14 (0.7) | 10 (1) | .52 | |
|
| Left colon only, n (%) | 4 (0.2) | 1 (0.1) | .88 | |
|
| Both, n (%) | 1 (0.1) | 1 (0.1) | .80 | |
|
| |||||
|
| Overall, n (%) | 3 (0.2) | 0 (0) | .54 | |
|
| Right colon only, n (%) | 0 (0) | 0 (0) |
| |
|
| Left colon only, n (%) | 3 (0.2) | 0 (0) | .54 | |
|
| Both, n (%) | 0 (0) | 0 (0) |
| |
aAdvanced adenomas were defined as adenomas ≥1 cm in size or with pathological features such as high-grade dysplasia or villous features.
bAdvanced sessile serrated lesions were defined as lesions ≥1 cm in size or with pathological features such as low or high-grade dysplasia.
Performance of the natural language processing pipeline in the testing data set (N=1000).
| Indicators | Recall | Precision | Accuracy | F1 score | ||||
| Presence of a conventional adenoma | 0.99 | 1.00 | 0.99 | 0.99 | ||||
|
| ||||||||
|
| None | 1.00 | 0.98 | 0.99 | 0.99 | |||
|
| Right colon only | 0.98 | 1.00 | 0.99 | 0.99 | |||
|
| Left colon only | 0.98 | 0.99 | 0.99 | 0.99 | |||
|
| Both | 0.99 | 0.97 | 0.99 | 0.98 | |||
| Presence of an advanced adenomaa | 1.00 | 0.97 | 0.99 | 0.99 | ||||
|
| ||||||||
|
| None | 0.99 | 1.00 | 0.99 | 0.99 | |||
|
| Right colon only | 1.00 | 0.93 | 0.99 | 0.97 | |||
|
| Left colon only | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| Both | 1.00 | 1.00 | 1.00 | 1.00 | |||
| Presence of an SSLb | 0.98 | 1.00 | 0.99 | 0.99 | ||||
|
| ||||||||
|
| None | 1.00 | 0.99 | 0.99 | 0.99 | |||
|
| Right colon only | 0.96 | 1.00 | 0.99 | 0.98 | |||
|
| Left colon only | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| Both | 1.00 | 0.86 | 0.99 | 0.92 | |||
| Presence of an advanced SSLc | 1.00 | 1.00 | 1.00 | 1.00 | ||||
|
| ||||||||
|
| None | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| Right colon only | 0.90 | 1.00 | 0.99 | 0.95 | |||
|
| Left colon only | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| Both | 1.00 | 0.50 | 0.99 | 0.67 | |||
|
|
|
|
|
| ||||
|
| 0 | 1.00 | 0.99 | 1.00 | 0.99 | |||
|
| 1-2 | 0.99 | 0.99 | 0.99 | 0.99 | |||
|
| 3-4 | 0.98 | 1.00 | 0.98 | 0.99 | |||
|
| 5-10 | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| >10 | N/Ad | N/A | N/A | N/A | |||
|
| ||||||||
|
| 0 | 1.00 | 0.99 | 1.00 | 0.99 | |||
|
| 1-2 | 0.98 | 1.00 | 0.98 | 0.99 | |||
|
| 3-4 | 1.00 | 1.00 | 1.00 | 1.00 | |||
|
| 5-10 | N/A | N/A | N/A | N/A | |||
aAdvanced adenomas were defined as adenomas ≥1 cm in size or with pathological features such as high-grade dysplasia or villous features.
bSSL: sessile serrated lesion.
cAdvanced sessile serrated lesions were defined as lesions ≥1 cm in size or with pathological features such as low or high-grade dysplasia.
dN/A: not applicable.
Comparison of polyp detection rate and surveillance interval group assignment as assessed by manual review and the natural language processing pipeline in the test data set (N=1000).
| Extracted indicators | Human annotator | Method | |||||||||||||||||||
|
| A | B | C | D | E | Manual reviewb | NLP system | Gold standardc |
| ||||||||||||
|
| |||||||||||||||||||||
|
| ADRd | 467 | 474 | 474 | 475 | 468 | 472 | 468 | 475 | .92 | |||||||||||
|
| SDRe | 65 | 64 | 66 | 64 | 64 | 65 | 64 | 66 | .99 | |||||||||||
|
| |||||||||||||||||||||
|
| 1 year | N/Af | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |||||||||||
|
| 3 years | 59 | 58 | 60 | 62 | 58 | 59 | 63 | 63 | .92 | |||||||||||
|
| 3-5 years | 62 | 67 | 64 | 63 | 68 | 65 | 68 | 69 | .92 | |||||||||||
|
| 5-10 years | 40 | 40 | 40 | 40 | 40 | 40 | 39 | 40 | .99 | |||||||||||
|
| 7-10 years | 339 | 347 | 345 | 345 | 346 | 344 | 343 | 347 | .99 | |||||||||||
|
| 10 years | 479 | 480 | 481 | 480 | 480 | 480 | 480 | 481 | .99 | |||||||||||
aP values were calculated using the 2X3 chi-square test.
bMean of the judgments made by the 5 human annotators.
cConsensus judgment of the 5 human annotators; applied in inconsistent cases.
dADR: adenoma detection rate.
eSDR: sessile serrated lesion detection rate.
fN/A: not applicable (no patients were assigned a 1-year surveillance interval).
Clinical application of the natural language processing pipeline to nonannotated colonoscopy data created by 25 endoscopists between 2010 and 2019.
| Endoscopist | Procedures | Adenoma detection rate, n (%) | Advanced adenoma detection rate, n (%) | Sessile serrated lesion detection rate, n (%) | Advanced sessile serrated lesion detection rate, n (%) | Mean surveillance interval, years |
| A | 3060 | 1112 (36.3) | 94 (3.1) | 58 (1.9) | 8 (0.3) | 8.9 |
| B | 981 | 343 (35) | 36 (3.7) | 8 (0.8) | 0 (0) | 9.0 |
| C | 3553 | 1447 (40.7) | 129 (3.6) | 91 (2.6) | 21 (0.6) | 8.8 |
| D | 2765 | 1109 (40.1) | 92 (3.3) | 83 (3) | 17 (0.6) | 8.8 |
| E | 1174 | 469 (39.9) | 46 (3.9) | 18 (1.5) | 3 (0.3) | 8.9 |
| F | 1258 | 338 (26.9) | 39 (3.1) | 21 (1.7) | 1 (0.1) | 9.2 |
| G | 679 | 301 (44.3) | 12 (1.8) | 40 (5.9) | 11 (1.6) | 8.6 |
| H | 1165 | 505 (43.3) | 83 (7.1) | 21 (1.8) | 4 (0.3) | 8.4 |
| I | 1615 | 264 (16.3) | 30 (1.9) | 6 (0.4) | 0 (0) | 9.5 |
| J | 2091 | 917 (43.9) | 43 (2.1) | 92 (4.4) | 12 (0.6) | 8.7 |
| K | 1876 | 1055 (56.2) | 58 (3.1) | 124 (6.6) | 16 (0.9) | 8.2 |
| L | 3284 | 1739 (53) | 73 (2.2) | 144 (4.4) | 14 (0.4) | 8.4 |
| M | 3437 | 1510 (43.9) | 116 (3.4) | 132 (3.8) | 3 (0.1) | 8.6 |
| N | 3799 | 1708 (45) | 119 (3.1) | 130 (3.4) | 13 (0.3) | 8.6 |
| O | 647 | 292 (45.1) | 14 (2.2) | 14 (2.2) | 1 (0.2) | 8.8 |
| P | 1707 | 844 (49.4) | 74 (4.3) | 87 (5.1) | 16 (0.9) | 8.4 |
| Q | 2964 | 1435 (48.4) | 106 (3.6) | 137 (4.6) | 16 (0.5) | 8.5 |
| R | 3209 | 1235 (38.5) | 108 (3.4) | 99 (3.1) | 12 (0.4) | 8.8 |
| S | 2168 | 816 (37.6) | 52 (2.4) | 61 (2.8) | 8 (0.4) | 8.9 |
| T | 3834 | 1633 (42.6) | 119 (3.1) | 152 (4) | 23 (0.6) | 8.7 |
| U | 3935 | 1324 (33.6) | 127 (3.2) | 68 (1.7) | 9 (0.2) | 9.1 |
| V | 1936 | 1014 (52.4) | 114 (5.9) | 104 (5.4) | 17 (0.9) | 8.2 |
| W | 643 | 268 (41.7) | 33 (5.1) | 4 (0.6) | 0 (0) | 8.8 |
| X | 1469 | 680 (46.3) | 65 (4.4) | 73 (5) | 16 (1.1) | 8.5 |
| Y | 1313 | 551 (42) | 56 (4.3) | 39 (3) | 7 (0.5) | 8.7 |
| Total | 54,562 | 22,909 (42) | 1838 (3.4) | 1806 (3.3) | 248 (0.5) | 8.7 |
Proportion of patients assigned different surveillance intervals, sorted by endoscopists (N=25) with high, medium, and low adenoma detection rates and sessile serrated lesion detection rates.
| Surveillance interval | Adenoma detection rate, n (%) | Sessile serrated lesion detection rate, n (%) | |||||
|
| <30% | 30%-45% | >45% | <2% | 2%-4% | >4% | |
| 1 year | 0 (0) | 14 (0.04) | 13 (0.09) | 3 (0.02) | 8 (0.03) | 16 (0.1) | |
| 3 years | 77 (2.68) | 1918 (5.07) | 894 (6.44) | 603 (4.36) | 1284 (5.19) | 1002 (6.26) | |
| 3-5 years | 59 (2.05) | 2204 (5.83) | 1217 (8.77) | 545 (3.94) | 1557 (6.3) | 1378 (8.61) | |
| 5-10 years | 25 (0.87) | 670 (1.77) | 389 (2.80) | 138 (1.00) | 491 (1.99) | 455 (2.84) | |
| 7-10 years | 472 (16.43) | 11,213 (29.66) | 4953 (35.68) | 3527 (25.5) | 7508 (30.37) | 5603 (35.01) | |
| 10 years | 2231 (77.75) | 21,740 (57.5) | 6397 (46.08) | 8988 (64.98) | 13,851 (56.02) | 7529 (47.04) | |