| Literature DB >> 35969459 |
Jianlin Shi1,2,3, Keaton L Morgan3,4, Richard L Bradshaw3, Se-Hee Jung3,5, Wendy Kohlmann6,7, Kimberly A Kaphingst7,8, Kensaku Kawamoto3, Guilherme Del Fiol3.
Abstract
BACKGROUND: Family health history has been recognized as an essential factor for cancer risk assessment and is an integral part of many cancer screening guidelines, including genetic testing for personalized clinical management strategies. However, manually identifying eligible candidates for genetic testing is labor intensive.Entities:
Keywords: clinical natural language processing; cohort identification; family health history extraction; genetic testing of hereditary cancers
Year: 2022 PMID: 35969459 PMCID: PMC9412758 DOI: 10.2196/37842
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Study stages, including natural language processing (NLP) development (stage 1) and comparison between the NLP-augmented algorithm and an algorithm using only structured data (stage 2). EDW: enterprise data warehouse; FHH: family health history.
Figure 2Data set creation process. FHH: family health history. NCCN: National Comprehensive Cancer Network. NLP: natural language processing. *HNPCC: hereditary non-polyposis colorectal cancer. FAP: familial adenomatous polyposis. Other genetic mutations or cancer syndromes specified in the NCCN guideline but without a code in electronic health record (EHR) were not included.
An example of combining structured and unstructured data from FHHa assertions.
| Field names | Condition | Commentsb | Family member | Age of onset |
| Original data | CANCER | Breast, great-aunt, dx at age of 52 | AUNT | NULL |
| Combined | {{CANCER}} | Breast, great-aunt, dx at age of 52 | {{AUNT}} | {{}} |
| Annotations |
| |||
aFHH: family health history.
bIn this case, the comments field supplements or corrects the structured data, that is, CANCER is of the breast, and the family member (AUNT) is actually the patient’s great-aunt. FX_CANCER (FC): family member to cancer relationship; FX_ONSET: family member to age of onset relationship.
Figure 3Screenshot of the schema as implemented with the annotation tool Brat.
Figure 4Easy clinical information extractor processing workflow. Three major steps (blue boxes): (1) entity extraction—extract the entities from the family health history entries; (2) entity reconciliation—reconcile the conflicts between the extracted entities; (3) relation identification—link related entities. In each step, there are ≥1 natural language processing components to complete processing substeps.
Heuristic rules to reconcile entities.
| Structured fields | Example | Comments field | Example | Reconciliation |
| {{CANCER, COLON}} | Colorectal cancer–related | Lynch syndrome | Chose | |
|
| {{50}} |
| The late 50s | Chose |
|
| {{50}} |
| In 1985 | Chose |
|
| {{50}} |
| 10 years ago | Chose |
| NULL | {{}} |
| Deceased at age 60 years | Inferred the |
|
| {{AUNT}} | A specific | Great-aunt | Chose |
|
| {{MOTHER}} | A specific | And grandmother | Use |
|
| {{AUNT}} | Nonspecific | Father side | Chose |
| NULL | {{}} | Multiple | 2× sisters | Created two |
aWords in italics denote concepts in the NLP output according to the FHH annotation schema.
Patient characteristics in the NLPa development or evaluation data set and the NCCNb algorithm evaluation data set.
| Characteristic | NLP development or evaluation data set (n=2398) | NCCN algorithm evaluation data set (n=66,853) | |
| Gender (male), n (%) | 998 (41.2) | 24,524 (36.7) | |
|
| |||
|
| White | 1752 (73.2) | 51,171 (76.5) |
|
| Other | 359 (15) | 9510 (14.2) |
|
| Asian | 141 (5.9) | 2973 (4.4) |
|
| Black or African American | 67 (2.8) | 1450 (2.2) |
|
| Not reported | 56 (2.3) | 1226 (1.8) |
|
| American Indian or Alaska Native | 17 (0.7) | 523 (0.8) |
|
| Hispanic ethnicity | 327 (13.6) | 9147 (13.7) |
| Age (years), mean (SD) | 40.2 (9.6) | 42.6 (9.9) | |
aNLP: natural language processing.
bNCCN: National Comprehensive Cancer Network.
The performance on the snippet-level data set.
| Relation types | TPa | FPb | FNc | Precision | Recall | F1 score |
| FX_CANCERd | 489 | 32 | 31 | 0.94 | 0.94 | 0.94 |
| FX_SYNDROMEe | 2 | 1 | 3 | 0.67 | 0.40 | 0.50 |
| FX_GENE_MUTf | 2 | 0 | 0 | 1.00 | 1.00 | 1.00 |
| FX_ONSETg | 203 | 10 | 14 | 0.95 | 0.94 | 0.94 |
| Microaverageh | N/Ai | N/A | N/A | 0.94 (0.91-0.97) | 0.94 (0.90-0.96) | 0.94 (0.91-0.96) |
aTP: true positive.
bFP: false positive.
cFN: false negative.
dFX_CANCER: family member to cancer relation.
eFX_SYNDROME: family member to cancer syndrome relation.
fFX_GENE_MUT: family member to cancer-related gene-mutation relation.
gFX_ONSET: Family member to age of onset relationship.
hThese scores were computed using aggregated data, including all 4 relation types. The CIs were computed using the bootstrap method.
iN/A: not applicable.
Type of snippet-level errors and counts.
| Type of errors | False positive, n | False negative, n | Examples |
| Annotation error | 10 | 13 | A missed annotation |
| Data input typoa | 1 | 5 | bladdler cab |
| Out of vocabularya | 2 | 6 | Precancer |
| Context errora | 22 | 11 | Possible, colon cancer, died when pt was |
| Ambiguous input | 2 | 3 | {{CANCER, COLON}} ileum {{FATHER}} |
| Schema mismatchc | 6 | 10 | See above |
| Total | 43 | 48 | N/Ad |
aThese 3 types of errors are natural language processing (NLP)–caused errors or can be fixed by improving the NLP.
bca: cancer.
cThis type of error does not need to be fixed.
dN/A: not applicable.