| Literature DB >> 22870911 |
Anoop D Shah1, Carlos Martinez, Harry Hemingway.
Abstract
BACKGROUND: Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records.Entities:
Mesh:
Year: 2012 PMID: 22870911 PMCID: PMC3483188 DOI: 10.1186/1472-6947-12-88
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1Flowchart showing how the freetext matching algorithm analyses a text. Flowchart showing the stages in analysis of a text, with an example of analysis.
Figure 2Dialog box for reviewing results of freetext matching algorithm. Example of a dialog box in the Microsoft Access 2000 database, showing the interface for analysing a text and reviewing the results.
Examples of free text associated with read terms analysed by the freetext matching algorithm
| Standard | 55...11 | coronary severe triple vessel disease + impaired function | Read G340.11 |
| Standard | G20..00 | 176/100 but only taking doxaz 4mg od not bd as pw intended - increase to 4mg bd, pn recheck 4w please. otherwise isq but recent diagnosis ca prostate - seeing urol 1w - reassured re treatablility of this nowadays | Systolic BP: 176, Diastolic BP: 100, Follow up: 4 weeks, Read B46..00 |
| Standard | 32...00 | sinus bradycardia (58 bpm) nonspecific intraventricular conduction delay | Read R059.00 |
| Standard | 55...12 | ct did not show renal artery stenosis. | Negative: Read P769000 |
| Standard, append | G581.00 | acute give lasix and o2 and admitted | Read G581000 |
| Standard, append | 14A4.00 | had an mi in 1996 | Past medical history: Read G30..00 |
| Death | 22J..12 | found dead by family this afternoon, called ambulance, cd 1415hrs, sudden death, rpt to coroner | Read 213100 |
| Labtest | 42J..00 | original result: ’neutrophil count’ = 4.17 x109 / l (1. 5 - 7. 5) | Lab result: 4.17 |
| Sicknote | 9D2..00 | ### 09/01/2007 1 month heart problems, ### | Sickness certificate date: 9-Jan-2007, duration: 1 month |
| Standard | 5853.11 | significant mitral regurg | (no output; ‘regurg’ not recognised as an abbreviation for ‘regurgitation’) |
| Standard | 33B9500 | pt exercised according to the bruce protocol for 4 mins 39secs. he developed chest discomfort during the test with very significant, >3mm, st depression in leads v4, 5 and 6 and leads 2, 3 avf. ### | Read R065600 |
Analysis modes
| Death | Read term for death | Extract date of death and death certificate categories. Test results are not extracted. |
| Pregnancy | Read term for pregnancy or text stating ‘pregnant’ | Duration in weeks is interpreted as gestational age |
| Labtest | Read term for test type | A numerical value or ‘normal’, ‘abnormal’ (depending on the test type) can be interpreted as the test result |
| Normal | Read term for certain investigations (e.g. chest radiograph) | The words ‘normal’ or ‘abnormal’ can be interpreted as the result |
| Date | Read term stating date | The text is expected to contain a single date |
| Sicknote | Read term stating ‘MED3’ (sickness certificate) or similar | Dates are regarded as sick certificate start and end dates |
| Standard | Read term not in one of the above categories | Standard analysis |
Figure 3Source of cause of death information for deaths in 2001. Flow diagram showing determination of the underlying cause of death for a random sample of 3310 patients who died in 2001
Performance of freetext matching algorithm and MetaMap on test sets
| Number of texts | 1000 | 1000 | 1000 | 1000 | |||
| Number of words | 7534 | 25981 | 25981 | 25981 | |||
| True positives | 683 | 346 | 286 | 273 | |||
| False positives | 11 | 32 | 126 | 18 | |||
| False negatives | 52 | 101 | 161 | 174 | |||
| Precision, % | 98.4 (97.2, 99.2) | 91.5 (88.3, 94.1) | 69.4 (64.7, 73.8) | 93.8 (90.4, 96.3) | |||
| Recall, % | 92.9 (90.8, 94.7) | 77.4 (73.2, 81.2) | 64.0 (59.3, 68.4) | 61.1 (56.4, 65.6) | |||
| F-score | 0.96 | 0.84 | 0.67 | 0.74 | |||
| Number strictly correct | 625 | 315 | 260 | 247 | |||
| Precision strict, % | 90.1 (87.6, 92.2) | 83.3 (79.2, 86.9) | 63.1 (58.2, 67.8) | 84.9 (80.2, 88.8) | |||
| True positives | 84 | 304 | 295 | 453 | |||
| False positives | 2 | 22 | 55 | 41 | |||
| Precision, % | 97.7 (91.9, 99.7) | 93.3 (90.0, 95.7) | 84.3 (80.0, 87.9) | 91.7 (88.9, 94.0) | |||
| True positives | 767 | 650 | 581 | 726 | |||
| False positives | 13 | 54 | 181 | 59 | |||
| Precision, % | 98.3 (97.2, 99.1) | 92.3 (90.1, 94.2) | 76.2 (73.1, 79.2) | 92.5 (90.4, 94.2) | |||
| True positives | 5 | 57 | 0 | 92 | |||
| False positives | 5 | 18 | 0 | 33 | |||
| Precision, % | 50.0 (18.7, 81.3) | 76.0 (64.7, 85.1) | | 73.6 (65.0, 81.1) | |||
| Percentage of texts | 0 | 1.2 | 0.5 | 0.6 | |||
| True positives | 116 | 96 | | | |||
| False positives | 15 | 10 | | | |||
| False negative | 25 | 22 | | | |||
| Precision, % | 88.5 (81.8, 93.4) | 90.6 (83.3, 95.4) | | | |||
| Recall, % | 82.3 (74.9, 88.2) | 81.4 (73.1, 87.9) | | | |||
| F-score | 0.85 | 0.86 | | | |||
| True positives | | 105 | | | |||
| False positives | | 11 | | | |||
| False negatives | | 18 | | | |||
| Precision, % | | 90.5 (83.7, 95.2) | | | |||
| Recall, % | | 85.4 (77.9, 91.1) | | | |||
| F-score | 0.89 | ||||||
Comparison of precision (positive predictive value) and recall (sensitivity) of the Freetext Matching Algorithm (FMA) and MetaMap against the gold standard of manual review, for two test sets: ‘General’, a random sample of 500 texts from cases and 500 from controls in a study on coronary artery disease; and ‘Death’, a random sample of 1000 texts associated with Read terms for death or suicide in 2001.
Most common terms extracted by the freetext matching algorithm from texts associated with death in 2001
| OXMIS | BRONCHOPNEUMONIA | 9.9 |
| Read | Acute myocardial infarction | 4.8 |
| Read | [ | 4.0 |
| Read | Ischaemic heart disease | 3.4 |
| Read | CVA unspecified | 3.1 |
| OXMIS | PNEUMONIA | 2.5 |
| OXMIS | UNKNOWN CAUSE | 2.5 |
| Read | Congestive cardiac failure | 1.7 |
| Read | Septicaemia | 1.7 |
| OXMIS | CORONARY ARTERY ATHEROMA | 1.5 |
| Read | Lung cancer | 1.5 |
| Read | [ | 1.5 |
| OXMIS | CARCINOMA | 1.3 |
| Read | Chronic obstructive pulmonary disease | 1.3 |
| OXMIS | HYPERTENSION | 1.3 |
| Read | Cardiac arrest | 1.1 |
| Read | Deep vein thrombosis | 1.1 |
| Read | Left ventricular failure | 1.1 |
| Read | Sepsis | 1.1 |
| Read | Cerebrovascular disease | 1.0 |
Recording of cause of death the GPRD in a random sample of patients who died in 2001
| MCCD information in | 88 | 2.7 |
| structured data area | | |
| Read term on or after | 1179 | 35.6 |
| date of death | | |
| Read term before date | 207 | 6.5 |
| of death | | |
| MCCD information in | 82 | 2.5 |
| free text | | |
| Free text stating | 43 | 1.3 |
| cause of death | | |
| Other free text | 517 | 15.6 |
| No cause of death | 1194 | 36.1 |
| information | | |
| Total | 3310 | 100 |