| Literature DB >> 26380290 |
Jitendra Jonnagaddala1, Siaw-Teng Liaw2, Pradeep Ray3, Manish Kumar4, Hong-Jie Dai5, Chien-Yeh Hsu6.
Abstract
Heart disease is the leading cause of death worldwide. Therefore, assessing the risk of its occurrence is a crucial step in predicting serious cardiac events. Identifying heart disease risk factors and tracking their progression is a preliminary step in heart disease risk assessment. A large number of studies have reported the use of risk factor data collected prospectively. Electronic health record systems are a great resource of the required risk factor data. Unfortunately, most of the valuable information on risk factor data is buried in the form of unstructured clinical notes in electronic health records. In this study, we present an information extraction system to extract related information on heart disease risk factors from unstructured clinical notes using a hybrid approach. The hybrid approach employs both machine learning and rule-based clinical text mining techniques. The developed system achieved an overall microaveraged F-score of 0.8302.Entities:
Mesh:
Year: 2015 PMID: 26380290 PMCID: PMC4561944 DOI: 10.1155/2015/636371
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Overview of risk factors, indicator attribute, and time attribute.
| Risk factor | Indicator attribute | Time attribute |
|---|---|---|
| CAD | Mention, event, test result, and symptom | Before DCT, during DCT, and after DCT |
|
| ||
| Diabetes | Mention, high A1c, and high glucose | Before DCT, during DCT, and after DCT |
|
| ||
| Family history | Present, not present | Not applicable |
|
| ||
| Hyperlipidemia | Mention, high cholesterol, and high LDL | Before DCT, during DCT, and after DCT |
|
| ||
| Hypertension | Mention, high blood pressure | Before DCT, during DCT, and after DCT |
|
| ||
| Medication | ACE inhibitors, ACE inhibitors ARBs, amylin, antidiabetes medications, aspirin, beta-blockers, calcium-channel blockers, DPP-4 inhibitors, ezetimibe, fibrates, GLP-1 agonists, insulin, meglitinides, metformin, niacin, nitrates, obesity, statins, sulfonylureas, thiazide diuretics, thiazolidinediones, and thienopyridines | Before DCT, during DCT, and after DCT |
|
| ||
| Obesity | Mention, BMI, and waist circumference | Before DCT, during DCT, and after DCT |
|
| ||
| Smoking history | Current, past, ever, never, and unknown | Not applicable |
Figure 1Sample EHR with annotations of heart disease risk factors.
Figure 2Overview of heart disease risk factor information extraction system.
Features used by smoking history, sectionizer, and time attribute assigner classifiers.
| Component | Classification | Classifier | Classes | List of features |
|---|---|---|---|---|
| Smoking history | Sentence level | Naïve Bayes | Current, past, and never | Bag of words, POS tags |
|
| ||||
| Sectionizer | Sentence level | Conditional random fields | Section heading, section heading with text, and text | First word uppercased, all words uppercased, all words lowercased, dictionary match, first word, second word, previous sentence features, next sentence features, full stop, and containing colon |
|
| ||||
| Time attribute assigner | Phrase level | Naïve Bayes | Before DCT, during DCT, after DCT, and continuing | Identified risk factor spans, previous word, previous word POS tag, next word, next word POS tag, section information, and indicator attribute |
Performance on test set by indicator attributes.
| Risk factor | Macroaveraged | Microaveraged | ||||
|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| |
| CAD | ||||||
| Mention | 0.3346 | 0.3405 | 0.3375 | 0.5029 | 1.0000 | 0.6693 |
| Event | 0.1148 | 0.1138 | 0.1143 | 0.8806 | 0.4245 | 0.5728 |
| Test result | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Symptom | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Diabetes | ||||||
| Mention | 0.6887 | 0.6907 | 0.6897 | 0.9219 | 0.9972 | 0.9581 |
| High A1c | 0.1109 | 0.106 | 0.1084 | 0.8906 | 0.6951 | 0.7808 |
| High glucose | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Family history | ||||||
| Present | 0.0097 | 0.0097 | 0.0097 | 1.0000 | 0.2632 | 0.4167 |
| Not present | 0.9630 | 0.9630 | 0.9630 | 0.9725 | 1.0000 | 0.9861 |
| Hyperlipidemia | ||||||
| Mention | 0.4436 | 0.4436 | 0.4436 | 0.8444 | 0.962 | 0.8994 |
| High cholesterol | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| High LDL | 0.0331 | 0.0331 | 0.0331 | 0.7391 | 0.5862 | 0.6538 |
| Hypertension | ||||||
| Mention | 0.7062 | 0.7101 | 0.7082 | 0.9553 | 0.9918 | 0.9732 |
| High blood pressure | 0.2996 | 0.2889 | 0.2942 | 0.4858 | 0.7897 | 0.6016 |
| Medication | ||||||
| ACE inhibitors | 0.3320 | 0.3482 | 0.3399 | 0.8797 | 0.8325 | 0.8555 |
| ACE inhibitors ARBs | 0.1096 | 0.1128 | 0.1112 | 0.8667 | 0.8756 | 0.8711 |
| Amylin | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Antidiabetes Medications | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Aspirin | 0.3709 | 0.3930 | 0.3817 | 0.9079 | 0.7168 | 0.8011 |
| Beta-blockers | 0.3891 | 0.4047 | 0.3967 | 0.9302 | 0.7186 | 0.8108 |
| Calcium-channel blockers | 0.2010 | 0.2160 | 0.2082 | 0.9064 | 0.8052 | 0.8528 |
| DPP-4 inhibitors | 0.0039 | 0.0039 | 0.0039 | 1.0000 | 1.0000 | 1.0000 |
| Ezetimibe | 0.0214 | 0.0253 | 0.0232 | 0.6471 | 0.9167 | 0.7586 |
| Fibrates | 0.0506 | 0.05447 | 0.0525 | 0.8966 | 0.8667 | 0.8814 |
| GLP-1 agonists | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Insulin | 0.1790 | 0.1887 | 0.1837 | 0.8598 | 0.6987 | 0.7709 |
| Meglitinides | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Metformin | 0.2069 | 0.2228 | 0.2145 | 0.8439 | 0.8598 | 0.8518 |
| Niacin | 0.0123 | 0.0175 | 0.0144 | 0.4524 | 0.7600 | 0.5672 |
| Nitrates | 0.1031 | 0.1148 | 0.1086 | 0.803 | 0.5867 | 0.6780 |
| Obesity | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Statins | 0.4617 | 0.4786 | 0.4700 | 0.9199 | 0.8715 | 0.8950 |
| Sulfonylureas | 0.1518 | 0.1595 | 0.1555 | 0.9286 | 0.8125 | 0.8667 |
| Thiazide Diuretics | 0.1226 | 0.1376 | 0.1297 | 0.3058 | 0.7441 | 0.4335 |
| Thiazolidinediones | 0.0396 | 0.04475 | 0.0420 | 0.8841 | 1.0000 | 0.9385 |
| Thienopyridines | 0.1543 | 0.1673 | 0.1606 | 0.8914 | 0.8380 | 0.8639 |
| Obesity | ||||||
| Mention | 0.1589 | 0.1693 | 0.1639 | 0.7632 | 1.0000 | 0.8657 |
| BMI | 0.0136 | 0.0123 | 0.0129 | 1.0000 | 0.4118 | 0.5833 |
| Waist circumference | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Smoking history | ||||||
| Current | 0.0234 | 0.0234 | 0.0234 | 0.8000 | 0.3636 | 0.5000 |
| Past | 0.1479 | 0.1479 | 0.1479 | 0.8636 | 0.6726 | 0.7562 |
| Ever | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| Never | 0.1576 | 0.1576 | 0.1576 | 0.7864 | 0.6750 | 0.7265 |
| Unknown | 0.4728 | 0.4728 | 0.4728 | 0.7890 | 1.0000 | 0.8820 |
Performance on test set by time attributes.
| Risk factor | Macroaveraged | Microaveraged | ||||
|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| |
| CAD | ||||||
| Before DCT | 0.3434 | 0.2628 | 0.2977 | 0.5599 | 0.5827 | 0.5711 |
| During DCT | 0.3405 | 0.3176 | 0.3286 | 0.5117 | 0.8102 | 0.6272 |
| After DCT | 0.3327 | 0.3288 | 0.3307 | 0.5000 | 0.9771 | 0.6615 |
| Diabetes | ||||||
| Before DCT | 0.6848 | 0.6683 | 0.6765 | 0.9152 | 0.9255 | 0.9203 |
| During DCT | 0.6907 | 0.6699 | 0.6801 | 0.9245 | 0.9293 | 0.9269 |
| After DCT | 0.6887 | 0.6887 | 0.6887 | 0.9219 | 0.9972 | 0.9581 |
| Hyperlipidemia | ||||||
| Before DCT | 0.4504 | 0.4429 | 0.4466 | 0.8419 | 0.9007 | 0.8703 |
| During DCT | 0.4426 | 0.4407 | 0.4416 | 0.8382 | 0.9421 | 0.8872 |
| After DCT | 0.4436 | 0.4436 | 0.4436 | 0.8444 | 0.962 | 0.8994 |
| Hypertension | ||||||
| Before DCT | 0.7043 | 0.6868 | 0.6954 | 0.9526 | 0.9403 | 0.9464 |
| During DCT | 0.644 | 0.7364 | 0.6871 | 0.7432 | 0.9557 | 0.8362 |
| After DCT | 0.7062 | 0.7062 | 0.7062 | 0.9553 | 0.9918 | 0.9732 |
| Medication | ||||||
| Before DCT | 0.6768 | 0.6600 | 0.6683 | 0.8332 | 0.7923 | 0.8122 |
| During DCT | 0.6613 | 0.6519 | 0.6565 | 0.8095 | 0.7858 | 0.7975 |
| After DCT | 0.6729 | 0.6648 | 0.6688 | 0.8200 | 0.7943 | 0.8069 |
| Obesity | ||||||
| Before DCT | 0.1537 | 0.1518 | 0.1527 | 0.7383 | 0.9753 | 0.8404 |
| During DCT | 0.1693 | 0.1634 | 0.1663 | 0.8246 | 0.9400 | 0.8785 |
| After DCT | 0.1537 | 0.1518 | 0.1527 | 0.7383 | 0.9753 | 0.8404 |
| All risk factors | ||||||
| Before DCT | 0.7727 | 0.7881 | 0.7803 | 0.8224 | 0.8146 | 0.8185 |
| During DCT | 0.7463 | 0.8187 | 0.7808 | 0.7835 | 0.8470 | 0.8140 |
| After DCT | 0.7706 | 0.8321 | 0.8002 | 0.8136 | 0.8688 | 0.8403 |
Performance of baseline system on test set.
| Risk factor | Macroaveraged | Microaveraged | ||||
|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| |
| CAD | 0.2135 | 0.2311 | 0.2220 | 0.6652 | 0.5599 | 0.6080 |
| Diabetes | 0.6576 | 0.6745 | 0.6660 | 0.8692 | 0.9517 | 0.9086 |
| Family history | 0.9689 | 0.9689 | 0.9689 | 0.9689 | 0.9689 | 0.9689 |
| Hyperlipidemia | 0.4465 | 0.4412 | 0.4439 | 0.8434 | 0.9254 | 0.8825 |
| Hypertension | 0.3429 | 0.4833 | 0.4012 | 0.5579 | 0.6148 | 0.5850 |
| Medication | 0.5486 | 0.6534 | 0.5964 | 0.6227 | 0.7409 | 0.6767 |
| Obesity | 0.1402 | 0.1419 | 0.141 | 0.8447 | 0.8511 | 0.8479 |
| Smoking history | 0.6284 | 0.6284 | 0.6284 | 0.6284 | 0.6309 | 0.6296 |
|
| ||||||
| Overall | 0.6954 | 0.7634 | 0.7278 | 0.6779 | 0.7566 | 0.7151 |
Performance of HDRFSystem on test set.
| Risk factor | Macroaveraged | Microaveraged | ||||
|---|---|---|---|---|---|---|
| Precision | Recall |
| Precision | Recall |
| |
| CAD | 0.3455 | 0.2985 | 0.3203 | 0.5261 | 0.7334 | 0.6127 |
| Diabetes | 0.6876 | 0.6724 | 0.6799 | 0.9202 | 0.9483 | 0.9341 |
| Family history | 0.9728 | 0.9728 | 0.9728 | 0.9728 | 0.9728 | 0.9728 |
| Hyperlipidemia | 0.4504 | 0.4451 | 0.4477 | 0.8415 | 0.9334 | 0.8851 |
| Hypertension | 0.6970 | 0.7375 | 0.7166 | 0.8531 | 0.9613 | 0.9040 |
| Medication | 0.6703 | 0.6731 | 0.6717 | 0.8209 | 0.7908 | 0.8056 |
| Obesity | 0.1589 | 0.1652 | 0.1620 | 0.7683 | 0.9618 | 0.8542 |
| Smoking history | 0.8113 | 0.8113 | 0.8113 | 0.8113 | 0.8145 | 0.8129 |
|
| ||||||
| Overall | 0.8053 | 0.8515 |
| 0.8138 | 0.8472 |
|
Examples of rules used in HDRFSystem components.
| Component | Number of rules | Examples |
|---|---|---|
| Medication recognition | 12 | If the medication identified by MetaMap is from RxNorm terminology, assign risk factor with identified medication name. |
| If the medications identified by MetaMap include abbreviations from custom abbreviations dictionary, assign medication risk factor with full medication name. | ||
|
| ||
| Disease disorder recognition | 22 | If the disease identified by MetaMap is from SNOMED CT terminology and is either CAD or obesity or diabetes or hypertension or hyperlipidemia, assign risk factor with identified disease name. |
| If annotated text is identified by blood pressure lab value extractor and diastolic >90 or systolic >140, assign risk factor = “hypertension.” | ||
|
| ||
| Family history | 05 | If a sentence contains “cad” or “coronary artery disease” and contains “father,” “mother,” or brother, assign sentence as family history sentence. |
| If family history sentence contains age of death and age <45, assign family history = “present” or else “unknown.” | ||
|
| ||
| Smoking history | 07 | If a sentence contains terms from custom smoking terms dictionary, assign sentence as smoking history sentence. |
| If document does not contain smoking terms, assign smoking history = “unknown.” | ||
|
| ||
| Sectionizer | 04 | If a sentence is classified as “text” but contains terms from custom section headings dictionary, assign label “section heading.” |
| If a sentence is classified as “section heading with text” and contains “:”, extract text before “:” to obtain section information. | ||
|
| ||
| Indicator attribute assigner | 26 | If annotated text is identified by MetaMap, assign attribute = “mention.” |
| If annotated text is identified by blood pressure lab value extractor and diastolic >140 or systolic >90, assign indicator attribute = “high BP.” | ||
|
| ||
| Time attribute assigner | 01 | If time attribute assigner assigned class is “continuing,” assign time attributes = “before DCT,” “after DCT,” and “during DCT.” |