| Literature DB >> 33059588 |
Robert Chen1,2, Joyce C Ho1,3, Jin-Mann S Lin4.
Abstract
BACKGROUND: Unstructured data from clinical epidemiological studies can be valuable and easy to obtain. However, it requires further extraction and processing for data analysis. Doing this manually is labor-intensive, slow and subject to error. In this study, we propose an automation framework for extracting and processing unstructured data.Entities:
Keywords: Automation; Co-morbidity; Data extraction; Medication; Mylagic encephalomyelitis/chronic fatigue syndrome (ME/CFS); Natural language processing; Population; Tertiary; Unstructured data
Mesh:
Year: 2020 PMID: 33059588 PMCID: PMC7559204 DOI: 10.1186/s12874-020-01131-7
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1General workflow of data processing for medication and reason entries
Fig. 2An example of ATC codes (level 1–3) for medication acetylsalicylic acid (aspirin)
Fig. 3Examples of conditions/symptoms in the unstructured text and eventual mapping
Example of reason mapping from original unstructured text to a reason category
| Original Word | Stems | Reason Classes |
|---|---|---|
| arthritis | arthr | arthritis |
| musculoskeletal | musculoskelet | joint/musculoskeletal problem |
| gastritis | gastrit | gastrointestinal disease |
| ulcerative colitis | ulc colit | gastrointestinal disease |
| asthmatic | asthm | asthma |
| diarrhea | diarrhe | autonomic symptoms |
| depression | depress | depression and related disorders |
| mood | anxy | anxiety disorders |
| diabetes | diabet | diabetes |
Examples of representations of medications and reasons captured in unstructured text from a public health setting
| Aspirin | Aspirin ASA, Aspirin Bayer, Aspirin Generic, ASA |
| Albuterol | Albuteral, Albuteral Nebulizer, Albuteral Inhaler |
| Metoprolol | Metoprold ER, Metoprolo ER |
| Hypertension | High blood pressure, HTN, b/p |
| Arthritis | Arthritis (r) hip, arthritic joints, arthritis-anti-inflammatory |
Summary of feature set sizes for medications and reasons after the data processing workflow is applied
| # unique occurrences | ||
|---|---|---|
| Population based | Tertiary based | |
| 664 | 378 | |
| CFS | 140 | 378 |
| ISF | 308 | N/A |
| NF | 216 | N/A |
| 907 | 802 | |
| after mapping | 59 | 65 |
| 888 | 596 | |
| after mapping | 91 | 54 |
Example of original representations of medication names, potential alternate representation (e.g., misspelling) which are corrected by the automation framework, and the final medication class assigned via the ATC system
| Medication Name | Potential Alternate Representation | Final Medication Class |
|---|---|---|
| Fexofenadine | Fexafenedine, fexfenadine | antihistamines |
| Hydrochlorothiazide | Hydrochlorothazide, Hydrochlorthiazide | diuretics |
| Metoprolol | Metoprolol Succ ER, Metoprolol Tart | beta-adrenergic blocking agents |
| Quinapril | Quinipril | angiotensin converting enzyme inhibitors |
| Vitamin B12 | Vit B12, vitemin B12 | supplement - excluded |
Fig. 4Most common medication categories after the data processing workflow is applied. The horizontal axis represents the number of records of the medication (across all patients; a patient can have multiple records)
Fig. 5Most common reason (co-morbidity) categories after the data processing workflow is applied. The horizontal axis represents the number of records of the medication (across all patients; a patient can have multiple records)