| Literature DB >> 28815141 |
Ergin Soysal1, Jeremy L Warner2,3, Joshua C Denny2,4, Hua Xu1.
Abstract
Metastatic patterns of spread at the time of cancer recurrence are one of the most important prognostic factors in estimation of clinical course and survival of the patient. This information is not easily accessible since it's rarely recorded in a structured format. This paper describes a system for categorization of pathology reports by specimen site and the detection of metastatic status within the report. A clinical NLP pipeline was developed using sentence boundary detection, tokenization, section identification, part-of-speech tagger, and chunker with some rule based methods to extract metastasis site and status in combination with five types of information related to tumor metastases: histological type, grade, specimen site, metastatic status indicators and the procedure. The system achieved a recall of 0.84 and 0.88 precision for metastatic status detection, and 0.89 recall and 0.93 precision for metastasis site detection. This study demonstrates the feasibility of applying NLP technologies to extract valuable metastases information from pathology reports and we believe that it will greatly benefit studies on cancer metastases that utilize EHRs.Entities:
Year: 2017 PMID: 28815141 PMCID: PMC5543353
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.Annotation of pathology reports using conMarker. Each document was annotated for specimen site, procedure, histologic type, grade and metastatic status indicators. The reports were also tagged for existence of metastatic cancer and metastasis sites if any at the document level.
Figure 2.Entities and relationships for a specimen.
Figure 3.System architecture. After section detection, rest of the process is carried out on the diagnosis section. (POS: part-of-speech, LVG: Lexical Variant Generation, SCT: SNOMED CT, UMLS: Unified Medical Language System)
Generalizations of metastasis sites based on SNOMED CT.
| Label | Generic Class | SCTID |
| PUL | Pulmonary structure | 363536003 |
| OSS | Musculoskeletal system | 26107004 |
| HEP | Liver and/or biliary structure | 303270005 |
| BRA | Intracranial structure | 128319008 |
| MAR | Bone marrow structure | 14016003 |
| PLE | Pleural sac structure | 116006008 |
| PER | Peritoneal sac structure | 118762006 |
| ADR | Adrenal structure | 23451007 |
| SKI | Skin and/or surface epithelium Skin and subcutaneous tissue structure | 400199006 127856007 |
| LYM | Lymphoid system structure | 122490001 |
| OTH | Body structure and but, not included above | 123037004 |
Figure 5.State diagram for diagnosis section, a state transition of terms used in rule based extraction.
Summary of results for recognition of different entity types.
| Type | Number of Entities | Recall | Precision | F-Score |
| Specimen Site | 737 | 0.89 | 0.86 | 0.87 |
| Histological type | 241 | 0.84 | 0.88 | 0.86 |
| Grade | 201 | 0.87 | 0.96 | 0.91 |
| Metastatic Status Indicator | 120 | 0.88 | 0.95 | 0.91 |
| Procedure | 257 | 0.97 | 0.88 | 0.90 |
Document level extraction of cancer characteristics.
| Type | Number of Entities | Recall | Precision | F-Score |
| Metastasis Site | 113 | 0.89 | 0.93 | 0.91 |
| Metastasis status | 103 | 0.84 | 0.88 | 0.86 |
| L Node metastasis | 31 | 0.84 | 0.87 | 0.85 |
Diagnosis section entity types.
| Entity | Definition | Example |
| Specimen Site | Body site, where the specimen was taken | Skin |
| Histologic Type | Morphologic type of the tumor | Adenocarcinoma |
| Grade | Tumor differentiation | Moderately differentiated |
| Metastatic status indicator | Phrases denoting a metastatic tumor | Metastatic |
| Procedure | Method for obtaining the tumor | Biopsy |