| Literature DB >> 23452306 |
Zubair Afzal1, Martijn J Schuemie, Jan C van Blijderveen, Elif F Sen, Miriam C J M Sturkenboom, Jan A Kors.
Abstract
BACKGROUND: Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification system with high sensitivity to assist manual annotators.Entities:
Mesh:
Year: 2013 PMID: 23452306 PMCID: PMC3602667 DOI: 10.1186/1472-6947-13-30
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Total number of subjects and corresponding entries in the hepatobiliary disease and acute renal failure data sets
| Positive cases | 656 | 237 |
| Seen entries | 656 | 237 |
| Unseen entries | 61,179 | 58,022 |
| Negative cases | 317 | 3,751 |
| Seen entries | 317 | 3,751 |
| Implicit entries | 27,276 | 319,204 |
Sensitivity and specificity results of various classifiers trained on the hepatobiliary and the acute renal failure data sets, with and without set expansion
| | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Hepatobiliary | No | 0.5 | 0.99 | 0.03 | 0.99 | 0.03 | 0.99 | 0.07 | 0.99 | 0.04 |
| | Yes | 42 | 0.89 | 0.77 | 0.90 | 0.79 | 0.92 | 0.69 | 0.91 | 0.71 |
| Acute renal failure | No | 16 | 0.62 | 0.92 | 0.69 | 0.88 | 0.69 | 0.90 | 0.71 | 0.89 |
| Yes | 1363 | 0.39 | 0.98 | - | - | 0.45 | 0.99 | 0.41 | 0.98 | |
Sensitivity and specificity of various classifiers trained on the hepatobiliary data set for difference percentages of under-sampling
| 0 | 0.89 | 0.77 | 0.92 | 0.68 | 0.91 | 0.71 | 0.90 | 0.79 | 42 |
| 10 | 0.89 | 0.76 | 0.93 | 0.65 | 0.91 | 0.75 | 0.90 | 0.80 | 38 |
| 20 | 0.89 | 0.75 | 0.93 | 0.63 | 0.91 | 0.73 | 0.91 | 0.79 | 34 |
| 30 | 0.89 | 0.76 | 0.94 | 0.61 | 0.93 | 0.72 | 0.90 | 0.78 | 30 |
| 40 | 0.89 | 0.73 | 0.93 | 0.60 | 0.92 | 0.69 | 0.91 | 0.77 | 25 |
| 50 | 0.90 | 0.70 | 0.93 | 0.58 | 0.92 | 0.71 | 0.91 | 0.76 | 21 |
| 60 | 0.90 | 0.71 | 0.94 | 0.56 | 0.92 | 0.72 | 0.92 | 0.73 | 17 |
| 70 | 0.91 | 0.67 | 0.95 | 0.55 | 0.91 | 0.72 | 0.92 | 0.70 | 13 |
| 80 | 0.92 | 0.64 | 0.94 | 0.49 | 0.92 | 0.73 | 0.92 | 0.68 | 9 |
| 90 | 0.94 | 0.52 | 0.91 | 0.60 | 0.93 | 0.67 | 0.93 | 0.59 | 5 |
| 100 | 0.99 | 0.12 | 0.99 | 0.07 | 0.99 | 0.03 | 0.99 | 0.14 | 0.5 |
Sensitivity and specificity of various classifiers trained on the acute renal failure data set for difference percentages of under-sampling
| 0 | 0.62 | 0.92 | 0.69 | 0.90 | 0.71 | 0.89 | 0.69 | 0.88 | 16 |
| 10 | 0.64 | 0.90 | 0.74 | 0.89 | 0.75 | 0.89 | 0.69 | 0.87 | 14 |
| 20 | 0.64 | 0.89 | 0.75 | 0.83 | 0.75 | 0.88 | 0.74 | 0.86 | 13 |
| 30 | 0.66 | 0.88 | 0.76 | 0.82 | 0.76 | 0.88 | 0.75 | 0.85 | 11 |
| 40 | 0.70 | 0.85 | 0.75 | 0.87 | 0.74 | 0.88 | 0.75 | 0.85 | 9 |
| 50 | 0.74 | 0.81 | 0.76 | 0.80 | 0.77 | 0.76 | 0.76 | 0.82 | 8 |
| 60 | 0.82 | 0.72 | 0.77 | 0.81 | 0.84 | 0.68 | 0.83 | 0.82 | 6 |
| 70 | 0.83 | 0.67 | 0.83 | 0.70 | 0.83 | 0.61 | 0.86 | 0.77 | 5 |
| 80 | 0.86 | 0.56 | 0.89 | 0.49 | 0.90 | 0.44 | 0.90 | 0.45 | 3 |
| 90 | 0.92 | 0.41 | 0.90 | 0.43 | 0.89 | 0.43 | 0.92 | 0.39 | 2 |
Sensitivity and specificity of various classifiers trained on the hepatobiliary data set for difference percentages of over-sampling
| 0 | 0.89 | 0.77 | 0.92 | 0.68 | 0.91 | 0.71 | 0.90 | 0.79 | 42 |
| 100 | 0.90 | 0.72 | 0.96 | 0.52 | 0.94 | 0.64 | 0.93 | 0.73 | 21 |
| 200 | 0.90 | 0.70 | 0.96 | 0.47 | 0.96 | 0.56 | 0.94 | 0.67 | 14 |
| 300 | 0.91 | 0.70 | 0.97 | 0.44 | 0.96 | 0.54 | 0.95 | 0.65 | 11 |
| 400 | 0.91 | 0.71 | 0.98 | 0.45 | 0.97 | 0.50 | 0.95 | 0.63 | 8 |
| 500 | 0.92 | 0.69 | 0.98 | 0.43 | 0.97 | 0.48 | 0.95 | 0.62 | 7 |
| 600 | 0.92 | 0.68 | 0.97 | 0.35 | 0.96 | 0.47 | 0.95 | 0.61 | 6 |
| 700 | 0.92 | 0.67 | 0.98 | 0.34 | 0.97 | 0.47 | 0.95 | 0.60 | 5 |
| 800 | 0.92 | 0.65 | 0.97 | 0.34 | 0.97 | 0.47 | 0.95 | 0.61 | 5 |
| 900 | 0.93 | 0.65 | 0.97 | 0.34 | 0.97 | 0.45 | 0.95 | 0.59 | 4 |
| 1000 | 0.93 | 0.64 | 0.97 | 0.35 | 0.96 | 0.44 | 0.95 | 0.59 | 4 |
Sensitivity and specificity of various classifiers trained on the acute renal failure data set for difference percentages of over-sampling
| 0 | 0.62 | 0.92 | 0.69 | 0.90 | 0.75 | 0.89 | 0.69 | 0.88 | 16 |
| 100 | 0.66 | 0.86 | 0.78 | 0.80 | 0.81 | 0.76 | 0.74 | 0.75 | 8 |
| 200 | 0.71 | 0.81 | 0.84 | 0.71 | 0.84 | 0.65 | 0.77 | 0.67 | 5 |
| 300 | 0.74 | 0.77 | 0.89 | 0.59 | 0.88 | 0.65 | 0.80 | 0.65 | 4 |
| 400 | 0.76 | 0.73 | 0.89 | 0.51 | 0.86 | 0.64 | 0.81 | 0.61 | 3 |
| 500 | 0.77 | 0.69 | 0.89 | 0.48 | 0.84 | 0.64 | 0.82 | 0.60 | 3 |
| 600 | 0.78 | 0.66 | 0.91 | 0.48 | 0.89 | 0.59 | 0.82 | 0.60 | 2 |
| 700 | 0.82 | 0.60 | 0.92 | 0.43 | 0.89 | 0.54 | 0.82 | 0.60 | 2 |
| 800 | 0.82 | 0.57 | 0.94 | 0.37 | 0.86 | 0.60 | 0.82 | 0.61 | 2 |
| 900 | 0.83 | 0.55 | 0.93 | 0.36 | 0.89 | 0.53 | 0.83 | 0.61 | 2 |
| 1000 | 0.84 | 0.54 | 0.95 | 0.36 | 0.88 | 0.54 | 0.83 | 0.61 | 1 |
Sensitivity and specificity of various classifiers trained on the hepatobiliary data set for difference cost values of cost-sensitive learning
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 0.86 | 0.78 | 0.90 | 0.68 | 0.93 | 0.67 | 0.89 | 0.71 |
| 10 | 0.87 | 0.78 | 0.95 | 0.54 | 0.93 | 0.68 | 0.92 | 0.69 |
| 25 | 0.87 | 0.79 | 0.96 | 0.47 | 0.93 | 0.67 | 0.92 | 0.69 |
| 50 | 0.87 | 0.79 | 0.96 | 0.47 | 0.93 | 0.67 | 0.91 | 0.66 |
| 100 | 0.87 | 0.79 | 0.96 | 0.47 | 0.93 | 0.67 | 0.92 | 0.66 |
| 200 | 0.87 | 0.79 | 0.96 | 0.47 | 0.93 | 0.67 | 0.92 | 0.66 |
| 400 | 0.87 | 0.79 | 1.00 | 0.09 | 0.97 | 0.24 | 0.99 | 0.12 |
| 800 | 0.87 | 0.79 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
| 1000 | 0.87 | 0.79 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
Sensitivity and specificity of various classifiers trained on the acute renal failure data set for difference cost values of cost-sensitive learning
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| 1 | 0.59 | 0.92 | 0.74 | 0.85 | 0.78 | 0.80 | 0.67 | 0.73 |
| 10 | 0.59 | 0.92 | 0.81 | 0.63 | 0.78 | 0.80 | 0.73 | 0.69 |
| 25 | 0.59 | 0.92 | 0.81 | 0.63 | 0.78 | 0.80 | 0.76 | 0.64 |
| 50 | 0.59 | 0.92 | 0.89 | 0.35 | 0.78 | 0.80 | 0.78 | 0.60 |
| 100 | 0.59 | 0.92 | 1.00 | 0.00 | 0.78 | 0.80 | 0.97 | 0.11 |
| 200 | 0.59 | 0.92 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
| 400 | 0.59 | 0.92 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
| 800 | 0.59 | 0.92 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
| 1000 | 0.59 | 0.92 | 1.00 | 0.00 | 1.00 | 0.00 | 1.00 | 0.00 |
Performance of the classifiers with the highest sensitivity and a specificity of at least 0.5 on the hepatobiliary disease and acute renal failure data sets
| Hepatobiliary disease | SVM | 0.89 | 0.77 | 0.93 | 0.65 | 0.87 | 0.79 | ||
| MyC | 0.92 | 0.68 | 0.94 | 0.54 | 0.95 | 0.54 | |||
| C4.5 | 0.90 | 0.79 | 0.93 | 0.59 | 0.92 | 0.66 | |||
| RIPPER | 0.90 | 0.71 | 0.93 | 0.72 | 0.93 | 0.67 | |||
| Acute renal failure | SVM | 0.62 | 0.92 | 0.84 | 0.54 | 0.59 | 0.92 | ||
| MyC | 0.69 | 0.90 | 0.83 | 0.70 | 0.81 | 0.63 | |||
| C4.5 | 0.69 | 0.88 | 0.83 | 0.61 | 0.78 | 0.60 | |||
| RIPPER | 0.71 | 0.89 | 0.84 | 0.68 | 0.78 | 0.80 | |||
Error analysis of the false negatives by the MyC classifier trained on the hepatobiliary disease data set with 70% under-sampling
| Evidence not in the model | 13 (38) |
| Evidence removed by negation/speculation filter | 12 (35) |
| Spelling variations | 5 (15) |
| Labeling error | 4 (12) |