| Literature DB >> 26174442 |
Bevan Koopman1, Sarvnaz Karimi2, Anthony Nguyen2, Rhydwyn McGuire3, David Muscatello3, Madonna Kemp2, Donna Truran2, Ming Zhang2, Sarah Thackway3.
Abstract
BACKGROUND: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV.Entities:
Mesh:
Year: 2015 PMID: 26174442 PMCID: PMC4502908 DOI: 10.1186/s12911-015-0174-2
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Types of features — both term and concept-based — extracted from death certificates
| Feature type | Description | Example certificate extract | Resulting feature values | |
|---|---|---|---|---|
| Term | TokenStem | A token stem, i.e., the stemmedversion of a word. | Acute chronic renal failure | Acut, chronic, renal, failur. |
| TokenStem | The | chronic renal failure | Chronic renal, renal failur. | |
| Concept | SCTConceptId | SNOMED CT concept identifier (as extracted by the Medtex system) | chronic renal failure | 90688005. |
(Stemming is a process of removing and replacing word suffixes to arrive at a common root form of the word.)
List of keywords used to identify cause of death as diabetes and HIV
| Disease | Included keywords | Excluded keywords |
|---|---|---|
| Pneumonia | Pneumonia, Pnuemonia, Pnemonia, Pneomonia, Pneamonia, Penumonia, Pheumonia | Aapiration, Aspirare, Aspiranion |
| Influenza | Influenza, Influenza, H1N1, Swine Flu, Swineflu, Swine Influ, SwineInflu | Haemophilus Influenzae, Haemophilus |
| Diabetes | Diabetes, NIDDM, IDDM, Diabetes type 1, Diabetes Type I, Diabetes Type 2, Diabetes Type II, Type I diabetes, Type II diabetes, Type 1 diabetes, Type 2 diabetes, Type 2 diabetic mellitus, Type II diabetic mellitus, Diabetes mellitus Type 2, Diabetes mellitus Type II, Diabetic | |
| HIV | HIV, AIDS, human immunodeficiency |
Breakdown of the dataset according to disease of interest and ICD-10 code and based on underlying and alternative cause of death numbers
| Disease/ICD-10 | #Underlying COD | #Alternative COD | #Total |
|---|---|---|---|
| Diabetes | 7144 | 22647 | 29791 |
| E10 | 830 | 1933 | 2763 |
| E11 | 2449 | 10307 | 12756 |
| E13 | 2 | 19 | 21 |
| E14 | 3862 | 10387 | 14249 |
| O24 × | 1 | 1 | 2 |
| Influenza | 148 | 44 | 192 |
| J09 × | 0 | 0 | 0 |
| J10 × | 10 | 3 | 13 |
| J11 | 138 | 41 | 179 |
| Pneumonia | 7259 | 36688 | 43947 |
| J12 | 33 | 38 | 71 |
| J13 | 59 | 39 | 98 |
| J14 × | 5 | 11 | 16 |
| J15 | 241 | 405 | 646 |
| J16 × | 3 | 6 | 9 |
| J17 × | 0 | 0 | 0 |
| J18 | 6918 | 36189 | 43107 |
| HIV | 371 | 406 | 777 |
| B20 | 139 | 17 | 156 |
| B21 | 59 | 6 | 65 |
| B22 | 80 | 9 | 89 |
| B23 | 54 | 21 | 75 |
| B24 | 39 | 398 | 437 |
The diseases of interest are comprised of the sum of the individual ICD-10 codes they represent. Individual classifiers were not built for ICD-10 classes marked with a ‘ ×’ due to insufficient number of cases for these classes
Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV
| (a) Rule-based | |||||||
|---|---|---|---|---|---|---|---|
| Disease | Precision | Recall | F-measure | Confusion matrix | |||
| Classifier | Ground truth | ||||||
| - | + | ||||||
| Influenza | 0.94 | 0.89 | 0.92 | 68430 | 2 | - | |
| 4 | 34 | + | Influenza | ||||
| Pneumonia | 0.98 | 0.97 | 0.97 | 59351 | 215 | - | |
| 274 | 8630 | + | Pneumonia | ||||
| Diabetes | 0.98 | 0.96 | 0.97 | 62,519 | 100 | - | |
| 212 | 5639 | + | Diabetes | ||||
| HIV | 0.93 | 0.85 | 0.89 | 68,373 | 6 | - | |
| 14 | 77 | + | HIV | ||||
| Macro-average | 0.94 | 0.96 | 0.95 | ||||
| Micro-average | 0.98 | 0.98 | 0.98 | ||||
| (b) Machine learning | |||||||
| Disease | Precision | Recall | F-measure | Confusion matrix | |||
| Classifier | Ground truth | ||||||
| - | + | ||||||
| Influenza | 0.84 | 0.95 | 0.89 | 68425 | 7 | - | |
| 2 | 36 | + | Influenza | ||||
| Pneumonia | 0.98 | 0.97 | 0.97 | 59364 | 202 | - | |
| 279 | 8625 | + | Pneumonia | ||||
| Diabetes | 0.98 | 0.99* | 0.99* | 62522 | 97 | - | |
| 72 | 5779 | + | Diabetes | ||||
| HIV | 0.91 | 0.96 | 0.93 | 68370 | 9 | - | |
| 4 | 87 | + | HIV | ||||
| Macro-average | 0.93 | 0.97 | 0.94 | ||||
| Micro-average | 0.98 | 0.98 | 0.98 | ||||
Macro-average is the mean of the precision, recall, and f-measure values from the four classes above
Micro-average aggregates the values from the confusion matrix for all the classes and calculates the measures over all the data
Statistically significant differences between rules and machine learning as measured with a two-tailed z-test are marked with *, representing p<0.05
Fig. 1Classification performance results for diseases of interest: Influenza, Diabetes, Pneumonia and HIV. Error bars show 0.95 confidence intervals
Classification performance results for individual ICD10 classes
| Disease | Precision | Recall | F-measure | Confusion matrix | ||||
|---|---|---|---|---|---|---|---|---|
| Classifier | Ground truth | |||||||
| - | + | |||||||
| Diabetes | E10 | 0.76 | 0.97 | 0.86 | 67774 | 162 | - | |
| 14 | 520 | + | E10 | |||||
| E11 | 0.97 | 0.97 | 0.97 | 65852 | 89 | - | ||
| 78 | 2451 | + | E11 | |||||
| E13 | 0.40 | 0.50 | 0.44 | 68463 | 3 | - | ||
| 2 | 2 | + | E13 | |||||
| E14 | 0.96 | 0.97 | 0.96 | 65521 | 116 | - | ||
| 97 | 2736 | + | E14 | |||||
| Flu | J11 | 0.88 | 0.86 | 0.87 | 68431 | 4 | - | |
| 5 | 30 | + | J11 | |||||
| Pnuemonia | J12 | 1.00 | 0.93 | 0.97 | 68455 | 0 | - | |
| 1 | 14 | + | J12 | |||||
| J13 | 0.79 | 0.55 | 0.65 | 68447 | 3 | - | ||
| 9 | 11 | + | J13 | |||||
| J15 | 0.92 | 0.35 | 0.51 | 68331 | 4 | - | ||
| 88 | 47 | + | J15 | |||||
| J18 | 0.97 | 0.97 | 0.97 | 59480 | 244 | - | ||
| 286 | 8460 | + | J18 | |||||
| Macro-average | 0.85 | 0.78 | 0.80 | |||||
| Micro-average | 0.96 | 0.96 | 0.96 | |||||
Breakdown of classification errors according to different error categories
| Category | Total #errors | % of total | #coroner |
|---|---|---|---|
| records | cases | ||
| Classification errors: | 405 | 58.0 | 76 |
| Word variations | 75 | 11.5 | 12 |
| Word combinations | 98 | 13.0 | 15 |
| Secondary causes | 150 | 22.4 | 43 |
| Class confusion | 82 | 11.1 | 6 |
| Ground truth issues: | 290 | 42.0 | 130 |
| Ground truth class confusion | 98 | 12.7 | 10 |
| Ground truth error | 167 | 25.5 | 113 |
| Ground truth empty | 25 | 3.8 | 7 |
Categories are divided into actual classification errors and other issues related to the use of ICD-10 codes as the ground truth label
Common word variants identified during the manual analysis of classification errors
| Pneumonia | Influenza | Diabetes | HIV |
|---|---|---|---|
| Bronchopneumonia | influenzal | Non insulin | Acquired |
| immunodeficiency | |||
| syndrome | |||
| Pneumonitis | Type A | Non-insulin | Immune |
| deficiency | |||
| syndrome | |||
| Pneumonic | Type B | Diabetic | Human |
| immunosuppressive | |||
| virus | |||
| Broncho-pneumonia | Parainfluenza | DM | Human immuno |
| deficiency virus | |||
| Bronchopneumonitis | Haemophilus | IDD | |
| Pneumocystis | Haemophyllis | IDDI | |
| influenzae | |||
| Influenza A | |||
| Influenza B | |||
| Parainfluenza III | |||
| High influenza | |||
| Influenza-like |