| Literature DB >> 29783980 |
Jeongeun Lee1, Hyun-Je Song2, Eunsil Yoon3, Seong-Bae Park2, Sung-Hye Park4, Jeong-Wook Seo4, Peom Park5, Jinwook Choi6,7.
Abstract
BACKGROUND: Pathology reports are written in free-text form, which precludes efficient data gathering. We aimed to overcome this limitation and design an automated system for extracting biomarker profiles from accumulated pathology reports.Entities:
Keywords: Biomarkers; Cancer disease knowledge representation model; Clinical decision-making; Natural language processing; Pathology reports
Mesh:
Substances:
Year: 2018 PMID: 29783980 PMCID: PMC5963015 DOI: 10.1186/s12911-018-0609-7
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Representation model for information extraction of biomarker. a Reporting system for multi biopsy samples from a patient. As same biomarker test can be conducted for tissue slide #1 and tissue slide #2 of the patient, a IHC report can contain multiple “slide paragraph (TS_P)”. Also, multiple pathologic findings derived from multiple IHC tests can be reported in one SP report b) or in separate SP reports. c Representation model for information extraction of biomarker from pathology reports in Seoul National University Hospital. IHC:immunohistochemistry, SP:surgical pathologic, BN:biomarker name
Fig. 2Flow chart of information extraction from IHC report with an example. Step #1) Classify type of IHC report and choose appropriate parser for the input. Step #2) Normalize biomarker names recognized from step #1 using BN dictionary. Step #3) Normalize test results recognized from step #1 using TR dictionary. BN: biomarker name, TR: test result
Term expansion rules for biomarker names
| Rule | Example |
|---|---|
| 1) Replace Roman numbers with the corresponding Arabic numbers | Factor VIII → Factor 8 |
| 2) Replace Greek symbol with the corresponding letters | CD79 α → CD79 alpha |
| 3) Remove laboratory related terms | CD10(repeat) → CD10 |
| 4) Generate new variants | GAL3 = GAL-3 = GAL 3 |
| 5) Convert to lowercase | DESMIN→ desmin |
Fig. 3a Excerpt of a SP report, b corresponding parsing tree of SP report in 3a
Statistics of SNUH dataset
| Type of report | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | Total |
|---|---|---|---|---|---|---|---|
| IHC | 0 | 6370 | 7611 | 8067 | 10,889 | 8830 | 41,772 |
| SP | 160 | 6338 | 7507 | 7893 | 10,341 | 8280 | 40,519 |
IHC Immunohistochemistry reports, SP Surgical pathology reports
Extraction performance for IHC and SP
| A | Recognition | Normalization | ||||
| Recall | Precision | F1 | Recall | Precision | F1 | |
| TS_ID | 1 | 1 | 1 | – | – | – |
| BN | 0.999 | 1 | 0.999 | 1.000 | 0.946 | 0.972 |
| TR | 0.998 | 0.998 | 0.998 | 1.000 | 0.939 | 0.969 |
| B | Exact Matching | |||||
| Recall | Precision | F1 | ||||
| Organ | 0.896 | 0.953 | 0.924 | |||
| Diagnosis | 0.794 | 0.427 | 0.556 | |||
| C | Overlap matching | |||||
| Recall | Precision | F1 | ||||
| Organ | 0.901 | 0.961 | 0.930 | |||
| Diagnosis | 0.794 | 0.754 | 0.773 | |||
TS_ID Tissue slide ID, BN Biomarker name, TR Test result
Fig. 4The process of parsing SP reports; the main graph showing statistics for total BNs, and the corner graph showing statistics for total SP reports
Fig. 5Comparison of the statistics provided by our proposed approach for information extraction with the statistics provided by Pathpedia regarding the positive rate of biomarker assays. a statistics provided by Pathpedia, b the statistics provided by our proposed approach for adenocarcinoma, diffuse type and intestinal type in stomach
Fig. 6The creation rate of new BN term variants per year. Although a similar number of new IHC tests (described as BN-TR pairs) were performed each year at SNUH, the number of new biomarker name (BN) variants increased steadily every year. This increase is partially related to the increase of a new type of biomarkers (described as P_BN) which SNUH analyzes