| Literature DB >> 28344914 |
Robert Reihs1, Heimo Müller1, Stefan Sauer1, Kurt Zatloukal1.
Abstract
In this paper an automatic classification system for pathological findings is presented. The starting point in our undertaking was a pathologic tissue collection with about 1.4 million tissue samples described by free text records over 23 years. Exploring knowledge out of this "big data" pool is a challenging task, especially when dealing with unstructured data spanning over many years. The classification is based on an ontology-based term extraction and decision tree build with a manually curated classification system. The information extracting system is based on regular expressions and a text substitution system. We describe the generation of the decision trees by medical experts using a visual editor. Also the evaluation of the classification process with a reference data set is described. We achieved an F-Score of 89,7% for ICD-10 and an F-Score of 94,7% for ICD-O classification. For the information extraction of the tumor staging and receptors we achieved am F-Score ranging from 81,8 to 96,8%.Entities:
Keywords: Automatic classification; Biobank; Decision Trees; Text mining
Year: 2016 PMID: 28344914 PMCID: PMC5346425 DOI: 10.1007/s12553-016-0169-8
Source DB: PubMed Journal: Health Technol (Berl) ISSN: 2190-7196
Fig. 1104 classification trees with an overall number of 4285 nodes Created with Cytoscape
Fig. 2Classification tree for mamma carcinoma Created with Photoshop
Fig. 3A single branch of a decision tree with nodes showing the rules for the tree with the entry point of the synonym of “Gastrinoma”. Created with Photoshop
Precision, recall and F-Score values for the classification of tumor staging
| FM original | FM corrected | Classification | ||||||
|---|---|---|---|---|---|---|---|---|
| T Staging | Precision | F-Score | 74,4% | 53,6% | 55,6% | 48,6% | 96,2% | 95,0% |
| Recall | 41,9% | 43,1% | 93,8% | |||||
| N Staging | Precision | F-Score | 70,6% | 38,5% | 40,6% | 40,1% | 84,3% | 85,5% |
| Recall | 26,5% | 39,7% | 86,8% | |||||
| M Staging | Precision | F-Score | 60,0% | 20,5% | 47,2% | 49,8% | 81,2% | 82,8% |
| Recall | 12,4% | 52,6% | 84,5% | |||||
| G Grading | Precision | F-Score | 81,6% | 48,6% | 77,6% | 49,1% | 94,6% | 96,8% |
| Recall | 34,6% | 35,9% | 99,1% | |||||
| R Staging | Precision | F-Score | 61,1% | 17,2% | 21,0% | 22,2% | 77,0% | 84,9% |
| Recall | 10,0% | 23,6% | 94,5% | |||||
| L Staging | Precision | F-Score | 62,5% | 33,3% | 14,3% | 18,8% | 81,8% | 81,8% |
| Recall | 22,7% | 27,3% | 81,8% | |||||
| V Staging | Precision | F-Score | 100% | 12,5% | 100% | 12,5% | 100% | 92,9% |
| Recall | 6,7% | 6,7% | 86,6% | |||||
Input for data cleanup
| Patient name | Date of birth | Date of diagnosis | diagnosis |
|---|---|---|---|
| Graller Violetta | 03.02.1985 | 14.05.1999 | LOW DIFFERENTIATED INVASIVE DUCTAL MAMMA CACINOMA (NOS, 3,5 CM MAX: DIAMETER, MINIMAL RESECTION DISTANCE AFTER BASAL 5 MM). PT 2 |
| Graller Violeta | 03.02.1985 | 17.05.1999 | ESTROGENRECEPTOR: STRONG (SCORE 12) PROGESTERONRECEPTOR: STRONG |
Result data stored in Data warehouse
| Result Data Cleanup | |||
|---|---|---|---|
| Patient name | Date of birth | Date of diagnosis | Diagnosis |
| Graller Violetta | 03.02.1985 | 14.05.1999 | LOW DIFFERENTIATED INVASIVE DUCTAL MAMMA CARCINOMA (NOS, 3,5 CM MAX: DIAMETER, MINIMAL RESECTION DISTANCE AFTER BASAL 5 MM). PT 2 |
| Result Information Extraction | |||
| Diagnosis (preprocessing) | (MAMMA) LOW DIFFERENTIATED INVASIVE DUCTAL MAMMA CARCINOMA (NOS, 3,5 CM MAX: DIAMETER, MINIMAL RESECTION DISTANCE AFTER BASAL 5 MM). PT 2 | ||
| 15 METASTASES FREE LYMPHNODES. NON-INVASIVE CARCINOMA COLLECTIONS IN THE SINUS LACTIFERI. RESECTIONBOUNDARIES FREE OF CANCER R-0. GRADING 3, G-3, N 0; ESTROGENRECEPTOR: STRONG (SCORE 12) PROGESTERONRECEPTOR: STRONG | |||
| Data stored | T = 2 | G = 3 |
|
| R = 0 | ESTROGENRECEPTOR = (STRONG, 12) | PROGESTERONRECEPTOR = (STRONG, 9*(auto assigned)) | |
| Result Classification | |||
| ICD-10 = C50.9 | Malignant neoplasm of breast of unspecified site | ICD-O = 8500/3 | Infiltrating duct carcinoma, NOS |
Comparisons of mamma carcinoma distribution in text books for pathology and statistic reports
| Text mining Tool | W. Remmele, Pathologie [ | Harris JR, Diseases of the Breast [ | Böcker/Denk/Heitz, Pathologie [ | NCI, Cancer Statistics Review [ | |
|---|---|---|---|---|---|
| Ductal Ca | 78,9% | 67,9% | 65–80% | ca. 80% | 67,6% |
| Lobular Ca | 11,2% | 6,3% | 5–10% | 10–20% | 8,0% |
| Medullary Ca | 2,0% | 2,8% | <5% | <1% | 0,7% |
| Mucinous Ca | 3,6% | 2,2% | <2% | 2% | 2,5% |
| Tubular Ca | 2,4% | 0,7% | 1% | 1–2% | 1,6% |
| Papillary Ca | 1,9% | 0,9% | <2% | <1% | 0,6% |
Module run time, performed on a dual core system with 1,67GHz, 2GB RAM on Windows XP
| Number of Findings | Pool together | Spell correction | DB update | Initializing dictionary | Text Mining | DB update |
|---|---|---|---|---|---|---|
| 1. run/1 k | 0,7–0,9 sec | 0,4–0,6 sec | 0,6–1,3 sec | 20–30 sec | 0,3–0,6 sec | 0,5–1,4 sec |
| 2.runs/1 k | 0,7–0,9 sec | 0,4–0,6 sec | 0–1,3 sec | 20–30 sec | 0,3–0,6 sec | 0–1,4 sec |
| 1. run/10 k | 7–8,5 sec | 4–5 sec | 5,2–12,6 sec | 20–30 sec | 3,5–7,3 sec | 5–12,8 sec |
| 2. runs/10 k | 7–8,5 sec | 4–5 sec | 0–12,6 sec | 20–30 sec | 3,5–7,3 sec | 0–12,8 sec |
Fig. 4Overall survival of patients with an operation of a Malignant neoplasm of stomach with different tumor staging’s. Created with WebTool MedicalExplorer
Fig. 5Survival of Colon cancer patients diagnosed between 1985 to 1987 and 2000 and 2003 Created with R