| Literature DB >> 21992591 |
Scott Russell Halgrim1, Fei Xia, Imre Solti, Eithon Cadag, Ozlem Uzuner.
Abstract
BACKGROUND: Extracting medication information from clinical records has many potential applications, and recently published research, systems, and competitions reflect an interest therein. Much of the early extraction work involved rules and lexicons, but more recently machine learning has been applied to the task.Entities:
Year: 2011 PMID: 21992591 PMCID: PMC3194174 DOI: 10.1186/2041-1480-2-S3-S2
Source DB: PubMed Journal: J Biomed Semantics
A sample discharge summary excerpt and the corresponding entries in the gold standard
The fields inside an entry are separated by “||”. Each field is represented by the string and its position (i.e., “line number: token number” offsets). “nm” means the field value is not present for this medication event. The fields are medication name (m), dosage (do), mode (mo), frequency (f), duration (d), and reason (r).
The data sets used in our experiments
| Data Sets | # of Summaries | # of Entries | # of Fields | # of Name | # of Dose | # of Freq | # of Mode | # of Duration | # of Reason |
|---|---|---|---|---|---|---|---|---|---|
| Training set | 110 | 5970 (54.3) | 14886 (135.3) | 5684 (51.7) | 2929 (26.6) | 2740 (24.9) | 2146 (19.5) | 302 (2.7) | 1085 (9.9) |
| Dev set | 35 | 2401 (68.6) | 5988 (171.1) | 2302 (65.8) | 1163 (33.2) | 1096 (31.3) | 880 (25.1) | 111 (3.2) | 436 (12.5) |
| Test set | 251 | 8936 (35.6) | 22041 (87.8) | 8495 (33.8) | 4387 (17.5) | 3999 (15.9) | 3307 (13.2) | 511 (2.0) | 1342 (5.3) |
The numbers in parentheses are the average numbers of entries or fields per discharge summary.
The performance of field detection on the development set
| Precision | Recall | F-score | |
|---|---|---|---|
| Name | 91.2 | 88.5 | 89.9 |
| Dosage | 96.6 | 90.8 | 93.6 |
| Frequency | 93.9 | 89.0 | 91.8 |
| Mode | 95.7 | 90.3 | 92.9 |
| Duration | 73.8 | 43.2 | 54.5 |
| Reason | 72.2 | 31.0 | 43.3 |
| All fields | 92.6 | 84.5 | 88.4 |
Vertical precision, recall, and F-score.
The performance of the field linking step on the development set
| Input | Precision | Recall | F-score |
|---|---|---|---|
| Gold standard | 87.4 | 75.1 | 80.8 |
| System | 96.2 | 94.5 | 95.3 |
Gold standard input: assuming perfect input from the field detection step; System input: using the actual output of the system’s field detection step.
System performance on the development set with different feature sets
| Features | Precision | Recall | F-score |
| F1 | 72.5 | 60.3 | 65.8 |
| F1-F2 | 82.5 | 78.2 | 80.3 |
| F1-F3 | 88.4 | 77.9 | 82.8 |
| F1-F4a | 87.4 | 77.9 | 82.4 |
| F1-F4b | 88.1 | 79.4 | 83.5 |
Horizontal precision, recall, and F-score demonstrates that including all proposed feature sets and the external data lists results in the best performance.
System performance on the test set
| Field | Precision | Recall | F-score | |
|---|---|---|---|---|
| Horizontal | N/A | 88.6 | 80.2 | 84.1 |
| Name | 92.6 | 87.1 | 89.8 | |
| Dosage | 96.3 | 90.2 | 93.1 | |
| Frequency | 95.6 | 90.8 | 93.2 | |
| Mode | 96.7 | 90.2 | 93.3 | |
| Duration | 70.6 | 40.5 | 51.5 | |
| Reason | 73.4 | 34.7 | 47.1 | |
| All fields | 91.6 | 82.7 | 86.9 | |
Horizontal and vertical metrics for system trained on the union of the training and development sets using all features in sets F1-F4b.
Figure 1System performance on the development set with different training set sizes. Key: + represents horizontal F-scores with features in F1-F4b; ○ represents horizontal F-scores with features in F1-F4a.
Figure 2Cascade vs. Key: + represents horizontal F-scores with the three-module cascade; ○ represents horizontal F-scores with find_all.
Benchmark performances of the top five i2b2 systems on the test set
| Rank | Team | Precision | Recall | F-score |
| 1 | USyd | 89.6 | 82.0 | 85.7 |
| 2 | Vanderbilt | 84.0 | 80.3 | 82.1 |
| 3 | Manchester | 86.4 | 76.6 | 81.2 |
| 4 | NLM | 78.4 | 82.3 | 80.3 |
| 5 | BME-Humboldt | 84.1 | 75.8 | 79.7 |
Horizontal precision, recall, and F-score.