| Literature DB >> 31881965 |
Hong-Jie Dai1,2,3.
Abstract
BACKGROUND: Family history information (FHI) described in unstructured electronic health records (EHRs) is a valuable information source for patient care and scientific researches. Since FHI is usually described in the format of free text, the entire process of FHI extraction consists of various steps including section segmentation, family member and clinical observation extraction, and relation discovery between the extracted members and their observations. The extraction step involves the recognition of FHI concepts along with their properties such as the family side attribute of the family member concept.Entities:
Keywords: Family history information extraction; Named entity recognition; Neural sequence labeling modeling
Mesh:
Year: 2019 PMID: 31881965 PMCID: PMC6933890 DOI: 10.1186/s12911-019-0996-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1An example of the annotation of the family history information extraction task
The Normalized Family Names
| Degree | Normalized Family Names |
|---|---|
| 1 | Father, Mother, Parent, Sister, Brother, Daughter, Son, Child |
| 2 | Grandmother, Grandfather, Grandparent, Cousin, Sibling, Aunt, Uncle |
Fig. 2The neural sequence labeling model employed in the study for the family history information extraction task. Input containing numerical character such as “0-yr” was normalized to “0”
Fig. 3The distribution of the family members and the corresponding family side attributes in the training set of the FHIE corpus
Semantic Types Considered in This Study
| Semantic Types (Abbreviationa) | |
| aapp, acab, aggp, anab, bacs, bdsu, bdsy, bird, blor, bpoc, bsoj, cell, cgab, clna, cnce, comd, drdd, dsyn, elii, emod, euka, famg, fndg, fngs, ftcn, genf, gngm, hlca, hops, idcn, inbe, inch, inpo, inpr, irda, lang, mamm, menp, mnob, mobd, neop, npop, orch, orga, orgf, patf, phsf, phsu, plnt, podg, popg, qlco, qnco, sosy, spco, tisu, tmco, topp, virs, vita |
aThe full name definition can be found at https://mmtx.nlm.nih.gov/MMTx/semanticTypes.shtml.
Hyper-parameters of the Developed Neual Sequence Labeling Nework
| Parameter | Value | Parameter | Value |
|---|---|---|---|
| word embedding size | 200 | Learning rate (LR) | 0.01 |
| char embedding size | 30 | Batch size | 10 |
| char embedding kernel size | 3 | Optimizer | SGD |
| number of char embedding kernels | 50 | Dropout | 0.5 |
| PoS embedding size | 20 | LR decay | 0.05 |
| UMLS embedding size | 200 | L2 regularization | 1e-8 |
| Epoch | 1000 |
Performance Comparison with CRF-based Methods on the Training Set with 10-fold Cross Validation
| Configuration | Precision | Recall | F1-score |
|---|---|---|---|
| Baseline | 0.882 | 0.857 | 0.870a |
| CRF-Baseline | 0.836 | 0.743 | 0.787 |
| Side | 0.902 | 0.855 | 0.878a |
| CRF-Side | 0.865 | 0.753 | 0.805 |
| Relation-side | 0.883 | 0.854 | 0.869a |
| CRF-Relation-side | 0.850 | 0.700 | 0.768 |
a Indicates passing the significant test under the level of 0.001. The p-values for the three configurations are 0.000006, 0.00005, and 0.000000004 respectively
Fig. 4The official test set results for the family information
The Performance of the Top-ranked Systems in the Family History Information Extraction Task
| Team | Precision | Recall | F1-score |
|---|---|---|---|
| X Shi, D Jiang, Y Huang, X Wang, Q Chen, J Yan and B Tang [ | 0.8886 | 0.8837 | 0.8861 |
| Anshik, V Gela and S Madgi [ | 0.8819 | 0.7964 | 0.837 |
| D Kim, S-Y Shin, H-W Lim and S Kim [ | 0.7932 | 0.8393 | 0.8156 |
| Our System | 0.8285 | 0.8698 | 0.8486 |
Summary of The Methods and Resource Used by All Participating Teams in the Family History Information Extraction Task
| Type | Description |
|---|---|
| Methodology | CRF, Bidirectional LSTM-CRF, Bidirectional CNN-LSTM-CRF, Pattern |
| Word embedding | GloVe: word2vec: |
| Part-of-speech | NLTK (Natural language toolkit), MedPost |
| Ontology/Lexicon | MeSH, Mayo Clinic website and UMLS embedding ( |
Fig. 5Performance comparison by using different word embeddings
Fig. 6Performance comparison without the UMLS (embedding) features
Fig. 7The performance comparison of the three submitted runs without considering the family side attributes
The challenging cases in the test set of the family information extraction entity recognition subtask. The family mentions in italic and bold face were false positive cases
| 1 | Leah’s father’s brother[uncle/paternal], a 35-year-old gentleman, is considered by .. Leah’s father’s[grandfather/paternal] 33-year-old sister[aunt/paternal] is described as dysmorphic with dysmorphic and ... Leah’s father’s mother[grandmother/paternal] developed unilateral renal artery stenosis … That lady’s sister[aunt/paternal] is reported to have coronary artery disease … |
| 2 | Suzanne has a maternal aunt who died at age 55 of a liver cancer, and this aunt has |
| 3 | One of |
| 4 | Ms. Natividad’s father is healthy at the age of 80. He had one sibling, |
| 5 | Mrs. Manuela reports a maternal aunt had |
| 6 | The father died at age 89 with hydrocephalus. In his |
| 7 | Her mother did have a total of five healthy |
| 8 | The patient’s next sister was diagnosed with schizophrenia at the age of 43. … She has |
| 9 | Suzanne’s husband is 20 and has autism. His |
| 10 | Hannelore has a healthy 38-year-old sister who is a carrier for urethral cancer and has a healthy 7-month-old |
| 11 | The father’s |
Fig. 8Performance of the submitted runs without adding the CRF inference layer