| Literature DB >> 35333345 |
Sicheng Zhou1, Nan Wang2, Liwei Wang3, Hongfang Liu3, Rui Zhang1,4.
Abstract
OBJECTIVE: Accurate extraction of breast cancer patients' phenotypes is important for clinical decision support and clinical research. This study developed and evaluated cancer domain pretrained CancerBERT models for extracting breast cancer phenotypes from clinical texts. We also investigated the effect of customized cancer-related vocabulary on the performance of CancerBERT models.Entities:
Keywords: CancerBERT; cancer phenotyping; electronic health record; name entity recognition; natural language processing
Mesh:
Year: 2022 PMID: 35333345 PMCID: PMC9196678 DOI: 10.1093/jamia/ocac040
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Figure 1.Examples of annotation in INCEpION.
Breast cancer phenotypes, their potential values, and examples in clinical texts
| Phenotypes | Values | Examples of descriptions in clinical text |
|---|---|---|
| Hormone receptor type | Positive, negative | HER2 gene was amplified; estrogen receptor: positive (95%, strong staining); tumor is PR negative (0% staining) |
| Tumor size | Numeric values describe volumes | Tumor size: 1.0×0.5×0.7 cm |
| Tumor site | Description of positions | Tumor is at 12 o’clock position and 2 cm from the nipple. |
| Cancer grade | Numerical values: (1–3) | Histologic grade: 1 of 3; Sample shows Nottingham grade 2 lesions |
| Histological type | Ductal carcinoma in situ (DCIS); lobular carcinoma in situ (LCIS), etc. | Histologic type of invasive carcinoma: ductal carcinoma in situ |
| Tumor laterality | Right, left | Specimen laterality: right breast; |
| Cancer stage | TNM staging: TX, Tis, T1-4; NX, N0, N1-3; M0, M1 | Pathologic stage is pT4 NX MX |
HER2: human epidermal growth factor receptor 2.
. Annotation statistics
| Total number | Total unique entities | ||
|---|---|---|---|
| Annotated statistics | Documents | 200 | NA |
| Total sentences | 9685 | NA | |
| Total tokens | 221 356 | NA | |
| Name entity statistics | Hormone receptor type | 1673 | 29 |
| Hormone receptor status | 436 | 14 | |
| Tumor size | 540 | 305 | |
| Tumor site | 329 | 173 | |
| Cancer grade | 271 | 15 | |
| Tumor laterality | 1192 | 4 | |
| Cancer stage | 173 | 38 | |
| Histological type | 1070 | 95 |
Figure 2.The training process of CancerBERT models. The CancerBERT models were pretrained based on the BlueBERT model. The process in the red box was implemented in this study. BERT: bidirectional encoder representations from transformers.
BERT fine-tuning NER entity level evaluation by exact match F1 score (lenient match F1 scores are shown in parenthesis)
| Entity type | BiLSTM-CRF | BERT-large origin | BlueBERT (PubMed + MIMIC III) | BioBERT (PubMed) | CharBERT (Wiki) | Character-BERT (Medical) | CancerBERTOrigVoc (EHRs corpus) | CancerBERTCustVoc_997 (EHRs corpus) | CancerBERTCustVoc_397 (EHRs corpus) |
|---|---|---|---|---|---|---|---|---|---|
| Hormone receptor type | 0.953 (0.957) | 0.976 (0.985) | 0.979 (0.984) | 0.982 (0.987) | 0.982 ( | 0.972 (0.983) |
| 0.979 (0.985) | 0.982 (0.985) |
| Hormone receptor status | 0.856 (0.856) | 0.846 (0.846) | 0.885 (0.885) | 0.859 (0.859) | 0.878 (0.878) | 0.851 (0.851) |
| 0.887 (0.887) | 0.891 (0.891) |
| Tumor size | 0.664 (0.709) | 0.663 (0.767) | 0.781 (0.819) |
| 0.727 (0.797) | 0.674 (0.684) | 0.765 (0.813) | 0.784 (0.824) | 0.781 ( |
| Tumor site | 0.562 (0.771) | 0.696 (0.769) | 0.711 (0797) | 0.749 (0.799) |
| 0.688 (0.762) | 0.733 (0.792) | 0.715 (0.787) | 0.727 ( |
| Cancer grade | 0.910 (0.910) | 0.857 (0.857) | 0.891 (0.891) | 0.886 (0.886) | 0.856 (0.856) | 0.833 (0.833) | 0.891 (0.891) | 0.898 (0.898) |
|
| Tumor laterality | 0.935 (0.935) | 0.926 (0.926) | 0.931 (0.931) | 0.943 (0.943) | 0.948 (0.948) | 0.934 (0.934) | 0.939 (0.939) | 0.947 (0.947) |
|
| Cancer stage | 0.908 (0.908) | 0.804 (0.804) | 0.870 (0.870) | 0.869 (0.869) |
| 0.907 (0.907) | 0.870 (0.870) | 0.885 (0.885) | 0.898 (0.898) |
| Histological type |
| 0.823 (0.918) | 0.843 (0.922) | 0.855 (0.934) | 0.850 (0.927) | 0.861 | 0.849 (0.922) | 0.862 (0.937) | 0.862 (0.938) |
| Macro average | 0.834 (0.873) | 0.824 (0.859) | 0.862 (0.887) | 0.868 (0.889) | 0.864 (0.888) | 0.840 (0.862) | 0.867 (0.889) | 0.871 (0.896) |
|
| Micro average | 0.876 (0.905) | 0.873 (0.907) | 0.898 (0.921) | 0.904 (0.926) | 0.899 (0.923) | 0.883 (0.906) | 0.903 (0.925) | 0.906 (0.930) |
|
Note. The scores were averaged scores based on 10 runs. The numbers in bold indicate the highest score.
BERT: bidirectional encoder representations from transformers; BiLSTM: bidirectional long short-term memory; EHRs: electronic health record systems; NER: name entity extraction.
Indicates statistically higher than other methods (CI: 0.95).
Coverage of unique annotated tokens for different BERT vocabularies stated as token count (percentage of total number of unique annotated tokens)
| Total number of unique annotated tokens | Exist in original BERT vocabulary | Exist in customized BERT vocabulary based on frequency | Exist in customized BERT vocabulary based on domain knowledge | |
|---|---|---|---|---|
| Hormone receptor type | 33 | 14 (42.4%) | 22 (66.7%) | 26 (78.8%) |
| Hormone receptor status | 11 | 4 (36.4%) | 6 (54.5%) | 8 (72.7%) |
| Tumor size | 160 | 62 (38.7%) | 62 (38.7%) | 62 (38.7%) |
| Tumor site | 146 | 88 (60.3%) | 95 (65.1%) | 95 (65.1%) |
| Cancer grade | 20 | 15 (75.0%) | 15 (75.0%) | 18 (90.0%) |
| Tumor laterality | 10 | 4 (40.0%) | 6 (60.0%) | 8 (80.0%) |
| Cancer stage | 58 | 12 (20.7%) | 18 (31.0%) | 52 (89.7%) |
| Histological type | 72 | 28 (38.9%) | 53 (73.6%) | 58 (80.6%) |
| Total | 426 | 178 (41.8%) | 227 (53.3%) | 274 (64.3%) |
BERT: bidirectional encoder representations from transformers.
Figure 3.Examples of token clusters in the visualization of word embeddings obtained from CancerBERTCustVoc_397 (a and b) and CancerBERTCustVoc_997 (c and d) models using t-SNE. BERT: bidirectional encoder representations from transformers; t-SNE: t-distributed stochastic neighbor embedding.