| Literature DB >> 31801515 |
Liwei Wang1, Lei Luo2, Yanshan Wang1, Jason Wampfler1, Ping Yang1, Hongfang Liu3.
Abstract
BACKGROUND: Lung cancer is the second most common cancer for men and women; the wide adoption of electronic health records (EHRs) offers a potential to accelerate cohort-related epidemiological studies using informatics approaches. Since manual extraction from large volumes of text materials is time consuming and labor intensive, some efforts have emerged to automatically extract information from text for lung cancer patients using natural language processing (NLP), an artificial intelligence technique.Entities:
Keywords: Histology; Lung cancer; Natural language processing; Stage; Treatments; Tumor grade
Mesh:
Year: 2019 PMID: 31801515 PMCID: PMC6894100 DOI: 10.1186/s12911-019-0931-8
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Study rationale
Fig. 2Study design. EHR: Electronic Health Record, RS: related sentences, DL: deep learning
Data elements contained in each data source
| Data Elements | Data Sources | |||
|---|---|---|---|---|
| Clinical Notes | Pathology Reports | Surgery Reports | Existing Dataset | |
| Stage | ✔ | ✔ | ✔ | ✔ |
| Histology | ✔ | ✔ | ✔ | ✔ |
| Tumor Grade | ✔ | ✔ | ✔ | ✔ |
| Chemotherapy | ✔ | ✔ | ✔ | |
| Radiotherapy | ✔ | ✔ | ✔ | |
| Surgery | ✔ | ✔ | ✔ | |
Normalized histological types and sub-types in the NLP system
| Histological types | Sub-types |
|---|---|
| Small cell | Small cell |
| Non-small cell | Adenocarcinoma Squamous Large / larger neuroendocrine Adenosquamous Carcinoid Carcinoid (typical / atypical) Non-small cell (NSCLC unspecified) Other NSCLC Other cell type / Unknown |
Normalized stages and tumor grade in the NLP system
| Standardized Stages | Standardized Tumor Grades |
|---|---|
| Ia | Well differentiated |
| Ib | Moderately differentiated |
| IIa | Poorly differentiated |
| IIb | Undifferentiated |
| IIIa | |
| IIIb | |
| IV | |
| Early stage | |
| Late stage | |
| Extensive (SCLC) | |
| Limited (SCLC) |
Fig. 3Architecture overview of the CNN model
Comparison of source coverage
| Sources | Coverage | |
|---|---|---|
| Existing dataset | 2311 | |
| Clinical notes | 2307 | |
| Pathology reports | Between 14 days before and 30 days after lung cancer diagnosis | 1660 |
| Between 14 days before and 60 days after lung cancer diagnosis | 1835 | |
| Between 14 days before and 90 days after lung cancer diagnosis | 1896 | |
| Surgery reports | Between 14 days before and 30 days after lung cancer diagnosis | 938 |
| Between 14 days before and 60 days after lung cancer diagnosis | 1002 | |
| Between 14 days before and 90 days after lung cancer diagnosis | 1023 | |
| Between 14 days before and 365 days after lung cancer diagnosis | 1130 | |
Precision and recall for all data elements using the NLP system
| Data elements | Number of patients in existing Dataset (A) | Number of patients with true NLP results (B) | Number of patients with NLP results (C) | Precision1 (B/A) | Precision2 (B/C) | Recall | Time window |
|---|---|---|---|---|---|---|---|
| Stage | 2127 | 1330 | 1883 | 0.625 | 0.706 | 0.885 | 90 days |
| 2127 | 1328 | 1883 | 0.624 | 0.705 | 0.885 | 60 days | |
| 2127 | 1325 | 1883 | 0.623 | 0.704 | 0.885 | 30 days | |
| Histology | 2208 | 1918 | 1989 | 0.869 | 0.885 | 0.982 | 90 days |
| 2208 | 1914 | 2164 | 0.867 | 0.884 | 0.980 | 60 days | |
| 2208 | 1889 | 2154 | 0.856 | 0.877 | 0.976 | 30 days | |
| Tumor grade | 1635 | 1182 | 1203 | 0.723 | 0.902 | 0.801 | 90 days |
| 1635 | 1170 | 1300 | 0.716 | 0.900 | 0.795 | 60 days | |
| 1635 | 1143 | 1274 | 0.700 | 0.897 | 0.779 | 30 days | |
| Chemotherapy | 1674 | 1674 | 1674 | 1 | 1 | 1 | 365 days |
| Radiotherapy | 769 | 769 | 769 | 1 | 1 | 1 | 365 days |
| Surgery | 312 | 312 | 312 | 1 | 1 | 1 | 365 days |
Fig. 4Comparison of recalls using the NLP system combining all longitudinal clinical notes, pathology reports and surgery reports of various time windows. 30, 60 or 90 days refers to using pathology reports and surgery reports between 14 days before and 30, 60 or 90 days after lung cancer diagnosis
Fig. 5Comparison of precision1 and precision2 using NLP system combining all longitudinal clinical notes, pathology reports and surgery reports of various time windows. 30, 60 or 90 days refers to using pathology reports and surgery reports between 14 days before and 30, 60 or 90 days after lung cancer diagnosis
Number of each histological cell type in training and testing data
| Histological types | Number (%) in training data set | Number (%) in testing data set |
|---|---|---|
| Adenocarcinoma | 897 (44.7%) | 37 (37%) |
| Adenosquamous | 16 (0.8%) | 2 (2%) |
| Carconoid | 1 (0.05%) | 0 |
| Carconoid typical /atypical | 15 (0.75%) | 1 (1%) |
| Large / larger neuroendocrine | 23 (1.1%) | 1 (1%) |
| Non-small cell | 342 (17.0%) | 15 (15%) |
| Other cell type /Unknown | 1 (0.05%) | 0 |
| Other NSCLC | 14 (0.70%) | 1 (1%) |
| Small cell | 339 (16.9%) | 21 (21%) |
| Squamous | 358 (17.8%) | 22 (22%) |
Error analysis
| Error types | Reason | Number |
|---|---|---|
| Failure of identifying subtypes | With no related information | 2 |
| With related information but ignored by algorithm | 2 | |
| Failure of identifying the type in reference standard | Mistake of reference standard | 1 |