| Literature DB >> 29888032 |
Abdulrahman K AAlAbdulsalam1, Jennifer H Garvin1,2, Andrew Redd3, Marjorie E Carter2, Carol Sweeny2, Stephane M Meystre4.
Abstract
Cancer stage is one of the most important prognostic parameters in most cancer subtypes. The American Joint Com-mittee on Cancer (AJCC) specifies criteria for staging each cancer type based on tumor characteristics (T), lymph node involvement (N), and tumor metastasis (M) known as TNM staging system. Information related to cancer stage is typically recorded in clinical narrative text notes and other informal means of communication in the Electronic Health Record (EHR). As a result, human chart-abstractors (known as certified tumor registrars) have to search through volu-minous amounts of text to extract accurate stage information and resolve discordance between different data sources. This study proposes novel applications of natural language processing and machine learning to automatically extract and classify TNM stage mentions from records at the Utah Cancer Registry. Our results indicate that TNM stages can be extracted and classified automatically with high accuracy (extraction sensitivity: 95.5%-98.4% and classification sensitivity: 83.5%-87%).Entities:
Year: 2018 PMID: 29888032 PMCID: PMC5961766
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Document types and counts for the corpus used in this study.
| Record Type | QCSET (n=60) | ABSTRACTION (n=240) | TOTAL (n=300) |
|---|---|---|---|
| NAACCR | 72 | 286 | 358 |
| E-path | 113 | 339 | 452 |
| 185 | 625 | 810 |
Inter-annotator agreement by document type in the QCSET. Method is Cohens Kappa for 2 raters.
| Document Type | Mentions annotated by both raters | Kappa | p-value |
|---|---|---|---|
| e-path | 60 | 0.658 | < 0:001 |
| NAACCR abstract | 125 | 0.9009 | < 0:001 |
| All | 185 | 0.8129 | < 0:001 |
TNM mentions extracted by annotators from corpus used as reference standard.
| Data Subset | Count of NAACCR and e-path Records | T | N | M | Total TNM annotations |
|---|---|---|---|---|---|
| Train (50%) | 405 | 235 | 192 | 86 | 513 |
| Development (17%) | 135 | 85 | 73 | 27 | 185 |
| Test (33%) | 270 | 139 | 119 | 52 | 310 |
| All | 810 | 459 | 384 | 165 | 1008 |
Figure 1:NLP and ML application high-level architecture.
TNM mentions extraction results.
| Evaluation Method | System | Development Set | Test Set | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F1-measure | Precision | Recall | F1-measure | ||
| Strict match | REGEX | 0.926 | 0.946 | 0.936 | 0.890 | 0.884 | 0.887 |
| CRF | 0.952 | 0.859 | 0.903 | 0.923 | 0.845 | 0.882 | |
| Partial match | REGEX | 0.958 | 0.961 | ||||
| CRF | 0.897 | 0.940 | 0.906 | 0.946 | |||
Pathological and clinical TNM classification results.
| Evaluation Set | System | Pathological | Clinical | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Prec. | Recall | F1 | Prec. | Recall | F1 | Prec. | Recall | F1 | ||
| Development | REGEX | 0.952 | 0.688 | 0.798 | 1.000 | 0.025 | 0.049 | 0.529 | 0.543 | 0.536 |
| REGEX-CRF | 0.901 | 0.889 | 0.895 | 0.681 | 0.800 | 0.736 | 0.847 | 0.858 | ||
| Test | REGEX | 0.934 | 0.536 | 0.681 | 1.000 | 0.051 | 0.097 | 0.386 | 0.384 | 0.385 |
| REGEX-CRF | 0.859 | 0.896 | 0.877 | 0.793 | 0.704 | 0.746 | 0.841 | 0.838 | ||
Example statements containing the ‘TX’ abbreviation.
| Statements with TX abbreviations |
|---|
| DISCUSSED PALLIATIVE TX W/ CARBO/TAXL … |
| NEW LUNG CANCER F/U & TX … |
Count of TNM mentions extracted from a selected set of cases
| Site | TNM | count |
|---|---|---|
| Colon | T | 2814 |
| N | 2635 | |
| M | 1012 | |
| Lung | T | 1615 |
| N | 1407 | |
| M | 634 | |
| Prostate | T | 2341 |
| N | 1409 | |
| M | 693 | |
| Total | 14560 |
Figure 2:Frequency of TNM stage mentions extracted per patient.
| NAACCR Item Number # | Text Field Name |
|---|---|
| 2520 | Text–Dx Proc–PE |
| 2530 | Text–DX Proc–X-ray/scan |
| 2540 | Text–DX Proc–Scopes |
| 2550 | Text–DX Proc–Lab Tests |
| 2560 | Text–DX Proc–Op |
| 2570 | Text–DX Proc–Path |
| 2580 | Text–Primary Site Title |
| 2590 | Text- Histology Title |
| 2600 | Text–Staging |
| 2610 | RX Text–Surgery |
| 2620 | RX Text–Radiation (Beam) |
| 2630 | RX Text–Radiation Other |
| 2640 | RX Text–Chemo |
| 2650 | RX Text–Hormone |
| 2660 | RX Text–BRM |
| 2670 | RX Text–Other |
| 2680 | RX Text–Remarks |
| 2690 | Text–Place of Diagnosis |