| Literature DB >> 35683353 |
Gyuseon Song1,2, Su Jin Chung3, Ji Yeon Seo3, Sun Young Yang3, Eun Hyo Jin3, Goh Eun Chung3, Sung Ryul Shim1,2,4, Soonok Sa1,2, Moongi Simon Hong1,2, Kang Hyun Kim1,2, Eunchan Jang1,2, Chae Won Lee1,2, Jung Ho Bae1,2,3, Hyun Wook Han1,2.
Abstract
Background and Aims: The utility of clinical information from esophagogastroduodenoscopy (EGD) reports has been limited because of its unstructured narrative format. We developed a natural language processing (NLP) pipeline that automatically extracts information about gastric diseases from unstructured EGD reports and demonstrated its applicability in clinical research.Entities:
Keywords: digestive system; endoscopy; gastritis; information extraction; natural language processing
Year: 2022 PMID: 35683353 PMCID: PMC9181010 DOI: 10.3390/jcm11112967
Source DB: PubMed Journal: J Clin Med ISSN: 2077-0383 Impact factor: 4.964
Figure 1Data flow chart of the study. EGD: esophagogastroduodenoscopy.
Figure 2Process of information extraction using the NLP pipeline from EGD and pathology reports. NLP: natural language processing; EGD: esophagogastroduodenoscopy.
Figure 3Example of extraction and summarization process of the NLP pipeline. Extent—1: antrum only; Extent—2: body/fundus only; Extent—3: antrum and body/fundus; NLP: natural language processing.
Demographics of the train dataset (n = 2000) and test dataset (n = 1000).
| Variables | Train Dataset | Test Dataset | ||
|---|---|---|---|---|
| Age, | 0.548 | |||
| <30 | 48 (2.4) | 19 (1.9) | ||
| 30–49 | 761 (38.0) | 395 (39.5) | ||
| 50–69 | 1091 (54.6) | 529 (52.9) | ||
| ≥70 | 100 (5.0) | 57 (5.7) | ||
| Sex, | 0.643 | |||
| Male | 1229 (61.4) | 605 (60.5) | ||
| Female | 771 (38.6) | 395 (39.5) | ||
| Chronic gastritis, | ||||
| Atrophic gastritis | 880 (44.0) | 415 (41.5) | 0.206 | |
| Intestinal metaplasia | 491 (24.6) | 223 (22.3) | 0.187 | |
| Superficial gastritis | 225 (11.2) | 97 (9.7) | 0.228 | |
| Erosive gastritis | 1195 (59.8) | 597 (59.7) | 0.989 | |
| Follicular gastritis | 5 (0.2) | 5 (0.5) | 0.433 | |
| Other gastric diseases, | ||||
| Ulcer | 163 (8.2) | 86 (8.6) | 0.726 | |
| Polyp | 404 (20.2) | 224 (22.4) | 0.177 | |
| SMT | 80 (4.0) | 36 (3.6) | 0.663 | |
| Dysplasia * | 22 (1.1) | 8 (0.8) | 0.559 | |
| Cancer † | 9 (0.4) | 8 (0.8) | 0.344 | |
* Dysplasia includes tubular adenoma with low and high-grade dysplasia. † Cancer includes carcinoma, neuroendocrine tumor, and lymphoma with mucosa-associated lymphoid tissue. SMT: submucosal tumor.
Performance of information extraction for gastritis using the NLP pipeline.
| Variables | Sensitivity | PPV | Accuracy | F1-Score | ||
|---|---|---|---|---|---|---|
| Atrophic Gastritis | ||||||
| Presence | 0.993 | 1.000 | 0.997 | 0.996 | ||
| Extent * | ||||||
| 1 | 0.952 | 1.000 | 0.993 | 0.945 | ||
| 2 | 1.000 | 0.800 | 0.999 | 0.889 | ||
| 3 | 0.992 | 0.988 | 0.995 | 0.990 | ||
| Intestinal Metaplasia | ||||||
| Presence | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Extent * | ||||||
| 1 | 0.959 | 0.973 | 0.995 | 0.966 | ||
| 2 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| 3 | 0.985 | 0.978 | 0.995 | 0.981 | ||
| Superficial Gastritis | ||||||
| Presence | 0.979 | 0.990 | 0.997 | 0.984 | ||
| Extent * | ||||||
| 1 | 0.984 | 1.000 | 0.999 | 0.992 | ||
| 2 | 0.885 | 1.000 | 0.997 | 0.939 | ||
| 3 | 1.000 | 0.900 | 0.999 | 0.974 | ||
| Erosive Gastritis | ||||||
| Presence | 0.990 | 0.998 | 0.993 | 0.994 | ||
| Extent * | ||||||
| 1 | 0.963 | 1.000 | 0.984 | 0.981 | ||
| 2 | 0.929 | 0.988 | 0.993 | 0.958 | ||
| 3 | 0.986 | 0.862 | 0.988 | 0.920 | ||
| Follicular Gastritis | ||||||
| Presence | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Extent * | ||||||
| 1 | 0.750 | 1.000 | 0.999 | 0.857 | ||
| 2 | N/A | N/A | 1.000 | N/A | ||
| 3 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Overall | 0.966 | 0.972 | 0.996 | 0.967 | ||
* Extent 1 means gastritis at the antrum only, extent 2 means gastritis at the body or fundus, and extent 3 means gastritis at the body or fundus as well as at the antrum. NLP: natural language processing. PPV: positive predictive value. N/A: not available.
Performance of information extraction for gastric ulcers, polypoid lesions, and neoplastic diseases using the NLP pipeline.
| Variables | Sensitivity | PPV | Accuracy | F1-Score | ||
|---|---|---|---|---|---|---|
| Ulcer | ||||||
| Presence | 0.988 | 0.977 | 0.997 | 0.983 | ||
| Location | ||||||
| Antrum | 0.956 | 0.985 | 0.996 | 0.970 | ||
| Body | 1.000 | 0.952 | 0.999 | 0.976 | ||
| Fundus | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Stages | ||||||
| Active | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Healing | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Scar | 0.977 | 0.956 | 0.997 | 0.966 | ||
| Size | N/A | N/A | 0.999 | N/A | ||
| Polyp | ||||||
| Presence | 0.991 | 1.000 | 0.998 | 0.996 | ||
| Location | ||||||
| Antrum | 1.000 | 0.964 | 0.997 | 0.982 | ||
| Body | 0.991 | 0.973 | 0.996 | 0.982 | ||
| Fundus | 0.983 | 1.000 | 0.999 | 0.991 | ||
| Size | N/A | N/A | 1.000 | N/A | ||
| SMT | ||||||
| Presence | 0.972 | 0.972 | 0.998 | 0.972 | ||
| Location | ||||||
| Antrum | 0.750 | 0.818 | 0.995 | 0.783 | ||
| Body | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Fundus | 0.833 | 1.000 | 0.998 | 0.909 | ||
| Size | N/A | N/A | 0.999 | N/A | ||
| Dysplasia * | ||||||
| Presence | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Location | ||||||
| Antrum | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Body | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Fundus | N/A | N/A | 1.000 | N/A | ||
| Size | N/A | N/A | 0.999 | N/A | ||
| Cancer † | ||||||
| Presence | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Location | ||||||
| Antrum | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Body | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Fundus | 1.000 | 1.000 | 1.000 | 1.000 | ||
| Size | N/A | N/A | 1.000 | N/A | ||
| Overall | 0.975 | 0.982 | 0.999 | 0.978 | ||
* Dysplasia includes tubular adenoma with low and high-grade dysplasia. † Cancer includes carcinoma, neuroendocrine tumor, and lymphoma with mucosa-associated lymphoid tissue. NLP: natural language processing. PPV: positive predictive value. SMT: submucosal tumor. N/A: not available.
Prevalence of gastritis, gastric ulcers, polypoid lesions, and neoplastic diseases by sex and age between 2010 and 2019 (n = 248,966).
| Gastritis | Gastric Ulcer, Polypoid Lesions, | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Variables | Atrophic Gastritis | Intestinal Metaplasia | Superficial Gastritis | Erosive Gastritis | Follicular Gastritis | Ulcer | Polyp | SMT | Dysplasia * | Cancer † | |
| Sex, | |||||||||||
| Male, | 68,719 | 29,081 | 19,906 | 47,792 | 308 | 3002 | 6050 | 3050 | 324 | 230 | |
| Female, | 41,517 | 12,241 | 19,362 | 29,760 | 696 | 1076 | 8361 | 3064 | 110 | 95 | |
| Age group, | |||||||||||
| 18–19, | 2 | 1 | 27 | 45 | 3 | 0 | 5 | 4 | 0 | 0 | |
| 20–29, | 276 | 38 | 1179 | 1425 | 121 | 38 | 320 | 74 | 0 | 4 | |
| 30–39, | 3455 | 619 | 5904 | 7440 | 358 | 224 | 1874 | 366 | 4 | 11 | |
| 40–49, | 21,996 | 5999 | 13,164 | 21,012 | 314 | 794 | 4329 | 1215 | 42 | 52 | |
| 50–59, | 46,125 | 16,735 | 13,134 | 29,084 | 176 | 1572 | 4626 | 2243 | 152 | 119 | |
| 60–69, | 27,937 | 12,386 | 4766 | 14,064 | 31 | 996 | 2366 | 1560 | 155 | 88 | |
| 70–79, | 9519 | 4998 | 996 | 4093 | 1 | 392 | 798 | 586 | 63 | 49 | |
| ≥80, | 926 | 546 | 98 | 389 | 0 | 62 | 93 | 66 | 18 | 2 | |
* Dysplasia includes tubular adenoma with low and high-grade dysplasia. † Cancer includes carcinoma, neuroendocrine tumor, and lymphoma with mucosa-associated lymphoid tissue. SMT: submucosal tumor.
Figure 4Extents and locations of gastritis (a) and other gastric diseases (b), respectively, between 2010 and 2019 (n = 248,966). The sum of percentages of each disease may be lower or more than 100% due to the reports with unspecified locations and gastric disease at multiple locations in the stomach, respectively. Dysplasia includes tubular adenoma with low and high-grade dysplasia. Cancer includes carcinoma, neuroendocrine tumor, and lymphoma with mucosa-associated lymphoid tissue. Extent 1: antrum only; Extent 2: body/fundus only; Extent 3: antrum and body/fundus; SMT: submucosal tumor.