| Literature DB >> 33219276 |
Yoojoong Kim1, Jeong Hyeon Lee2,3, Sunho Choi1, Jeong Moon Lee3, Jong-Ho Kim3,4, Junhee Seok1, Hyung Joon Joo5,6.
Abstract
Pathology reports contain the essential data for both clinical and research purposes. However, the extraction of meaningful, qualitative data from the original document is difficult due to the narrative and complex nature of such reports. Keyword extraction for pathology reports is necessary to summarize the informative text and reduce intensive time consumption. In this study, we employed a deep learning model for the natural language process to extract keywords from pathology reports and presented the supervised keyword extraction algorithm. We considered three types of pathological keywords, namely specimen, procedure, and pathology types. We compared the performance of the present algorithm with the conventional keyword extraction methods on the 3115 pathology reports that were manually labeled by professional pathologists. Additionally, we applied the present algorithm to 36,014 unlabeled pathology reports and analysed the extracted keywords with biomedical vocabulary sets. The results demonstrated the suitability of our model for practical application in extracting important data from pathology reports.Entities:
Mesh:
Year: 2020 PMID: 33219276 PMCID: PMC7679382 DOI: 10.1038/s41598-020-77258-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Number of unique keywords in the pathology report dataset.
| Dataset | Type | # |
|---|---|---|
| Training | Specimen | 619 |
| Procedure | 227 | |
| Pathology | 733 | |
| Test | Specimen | 200 |
| Procedure | 106 | |
| Pathology | 202 |
Figure 1Fine-tuning for the keyword extraction of pathology reports (A) Cross-entropy loss on the training and test sets according to the training step (B) F1 score on the test set according to the training step.
Figure 2Exact matching for the three types of pathological keywords according to the training step.
Figure 3Exact matching rate for the three types of pathological keywords according to the number of samples used to train the Bidirectional Encoder Representations from Transformers model (A) Specimen type (B) Procedure type (C) Pathology type.
Summary of keyword extraction performance for pathology reports.
| Methods | Precision | Recall | Exact Matching | ||||||
|---|---|---|---|---|---|---|---|---|---|
| SPE | PRO | PAT | SPE | PRO | PAT | SPE | PRO | PAT | |
| BERT | 0.9951 | 0.9985 | 0.9961 | 0.9962 | 0.9990 | 0.9938 | 0.9839 | 0.9956 | 0.9795 |
| LSTM | 0.9871 | 0.9932 | 0.9438 | 0.9764 | 0.9919 | 0.9387 | 0.9327 | 0.9868 | 0.9151 |
| Pre-trained LSTM | 0.9940 | 0.9978 | 0.9924 | 0.9915 | 0.9979 | 0.9934 | 0.9646 | 0.9794 | 0.9631 |
| CNN | 0.9740 | 0.9769 | 0.9320 | 0.9716 | 0.9758 | 0.9204 | 0.9327 | 0.9502 | 0.8770 |
| Pre-trained CNN | 0.9947 | 0.9958 | 0.9855 | 0.9903 | 0.9964 | 0.9823 | 0.9631 | 0.9690 | 0.9218 |
| Bayes Classifier | 0.9300 | 0.9601 | 0.8956 | 0.8946 | 0.9775 | 0.8227 | 0.7130 | 0.9078 | 0.5168 |
| Kea | 0.7321 | 0.1154 | 0.3499 | 0.3751 | 0.1076 | 0.1198 | 0.1010 | 0.0981 | 0.0190 |
| WINGNUS | 0.6227 | 0.1786 | 0.1552 | 0.3904 | 0.1650 | 0.1017 | 0.1098 | 0.1552 | 0.0835 |
SPE represents specimen type, PRO represents procedure type, and PAT represents pathology type.
Running times for model training.
| Methods | Times (s) |
|---|---|
| BERT (1 epoch) | 19.0 |
| LSTM (1 epoch) | 127.8 |
| CNN (1 epoch) | 32.0 |
| Bayes Classifier | 2.4 |
| Kea | 13,081.1 |
| WINGNUS | 10,815.5 |
Unique set of extracted keywords from unlabeled pathology reports.
| Type | # |
|---|---|
| Specimen | 3052 |
| Pathology | 3475 |
| Specimen + Pathology | 9084 |
| Procedure | 797 |
Figure 4Distribution for the maximum value of word similarity for each extracted keyword and the existing pathology vocabulary (A) Specimen + Pathology type and medical subject headings (MeSH) (B) Procedure type and MeSH (C) Specimen + Pathology type and North American Association of Central Cancer Registries (NAACCR) (D) Procedure type and NAACCR.
Figure 5Keyword extraction algorithm based on Bidirectional Encoder Representations from Transformers for pathology reports. SPE represents specimen type, PRO represents procedure type, and PAT represents pathology type.