| Literature DB >> 32569327 |
Guhan Ram Venkataraman1, Arturo Lopez Pineda1, Oliver J Bear Don't Walk Iv2, Ashley M Zehnder3, Sandeep Ayyar1, Rodney L Page4, Carlos D Bustamante1,5, Manuel A Rivas1.
Abstract
Unstructured clinical narratives are continuously being recorded as part of delivery of care in electronic health records, and dedicated tagging staff spend considerable effort manually assigning clinical codes for billing purposes. Despite these efforts, however, label availability and accuracy are both suboptimal. In this retrospective study, we aimed to automate the assignment of top-level International Classification of Diseases version 9 (ICD-9) codes to clinical records from human and veterinary data stores using minimal manual labor and feature curation. Automating top-level annotations could in turn enable rapid cohort identification, especially in a veterinary setting. To this end, we trained long short-term memory (LSTM) recurrent neural networks (RNNs) on 52,722 human and 89,591 veterinary records. We investigated the accuracy of both separate-domain and combined-domain models and probed model portability. We established relevant baseline classification performances by training Decision Trees (DT) and Random Forests (RF). We also investigated whether transforming the data using MetaMap Lite, a clinical natural language processing tool, affected classification performance. We showed that the LSTM-RNNs accurately classify veterinary and human text narratives into top-level categories with an average weighted macro F1 score of 0.74 and 0.68 respectively. In the "neoplasia" category, the model trained on veterinary data had a high validation accuracy in veterinary data and moderate accuracy in human data, with F1 scores of 0.91 and 0.70 respectively. Our LSTM method scored slightly higher than that of the DT and RF models. The use of LSTM-RNN models represents a scalable structure that could prove useful in cohort identification for comparative oncology studies. Digitization of human and veterinary health information will continue to be a reality, particularly in the form of unstructured narratives. Our approach is a step forward for these two domains to learn from and inform one another.Entities:
Mesh:
Year: 2020 PMID: 32569327 PMCID: PMC7307763 DOI: 10.1371/journal.pone.0234647
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Top-level coding mapping between ICD-9, ICD-10, and SNOMED-CT.
| Top-level category | Description | ICD-9 | ICD-10 | SNOMED-CT |
|---|---|---|---|---|
| 1 | Infectious and parasitic diseases | 001-139 | A00-B99 | 105714009, 68843000, 78885002, 344431000009103, 338591000009108, 40733004, 17322007 |
| 2 | Neoplasms | 140-239 | C00-D49 | 723976005, 399981008 |
| 3 | Endocrine, nutritional and metabolic diseases, and immunity disorders | 240-279 | E00-E90 | 85828009, 414029004, 473010000, 75934005, 363246002, 2492009, 414916001, 363247006, 420134006, 362969004 |
| 4 | Diseases of blood and blood-forming organs | 280-289 | D50-D89 | 271737000, 414022008, 414026006, 362970003, 11888009, 212373009, 262938004, 405538007 |
| 5 | Mental disorders | 290-319 | F00-F99 | 74732009 |
| 6 | Diseases of the nervous system | 320-359 | G00-G99 | 118940003, 313891000009106 |
| 7 | Diseases of sense organs | 360-389 | H00-H59, H60-H95 | 50611000119105, 87118001, 362966006, 128127008, 85972008 |
| 8 | Diseases of the circulatory system | 390-459 | I00-I99 | 49601007 |
| 9 | Diseases of the respiratory system | 460-519 | J00-J99 | 50043002 |
| 10 | Diseases of the digestive system | 520-579 | K00-K93 | 370514003, 422400008, 53619000 |
| 11 | Diseases of the genitourinary system | 580-629 | N00-N99 | 42030000 |
| 12 | Complications of pregnancy, childbirth, and the puerperium | 630-679 | O00-O99 | 362972006, 173300003, 362973001 |
| 13 | Diseases of the skin and subcutaneous tissue | 680-709 | L00-L99 | 404177007, 414032001, 128598002 |
| 14 | Diseases of the musculoskeletal system and connective tissue | 710-739 | M00-M99 | 105969002, 928000 |
| 15 | Congenital anomalies | 740-759 | Q00-Q99 | 111941005, 32895009, 66091009 |
| 16 | Certain conditions originating in the perinatal period | 760-779 | P00-P96 | 414025005 |
| 17 | Injury and poisoning | 800-899 | S00-T98 | 85983004, 75478009, 77434001, 417163006 |
Mapping of top-level categories was manually curated by two board-certified veterinarians trained in clinical coding.
Fig 1Diagram of the training and evaluation design.
Relevant acronyms: MIMIC: Medical Information Mart for Intensive Care; CSU: Colorado State University; MetaMap, a tool for recognizing medical concepts in text; LSTM: long-short term memory recurrent neural network classifier; RF: Random Forest classifier; DT: Decision Tree classifier.
Database statistics of patients, records, and species (records with diagnosis).
| CSU | MIMIC | |
|---|---|---|
| Medical Records | 89,591 | 52,722 |
| Patients | 33,124 | 41,126 |
| Hospital Visits | 89,591 | 49,785 |
| Humans (Homo Sapiens) | n.a. | 52,722 |
| Dogs (Canis Lupus) | 72,420 | n.a. |
| Cats (Felis Silvestris) | 10,205 | n.a. |
| Horses (Equus Caballus) | 5,819 | n.a. |
| Other mammals | 1,147 | n.a. |
| Infectious | 11,454 | 10,074 |
| Neoplasia | 36,108 | 6,223 |
| Endo-Immune | 17,295 | 24,762 |
| Blood | 10,171 | 13,481 |
| Mental | 511 | 10,989 |
| Nervous | 7,488 | 9,168 |
| Sense organs | 15,085 | 2,688 |
| Circulatory | 8,733 | 30,054 |
| Respiratory | 11,322 | 17,667 |
| Digestive | 22,776 | 14,646 |
| Genitourinary | 8,892 | 14,932 |
| Pregnancy | 136 | 133 |
| Skin | 21,147 | 4,241 |
| Musculoskeletal | 22,921 | 6,739 |
| Congenital | 3,347 | 2,334 |
| Perinatal | 54 | 3,661 |
| Injury | 9,873 | 16,121 |
The mappings in Table 1 were used to generate the categories and numbers presented here in Table 2. The seventeen categories represent the text classification labels.
Average F1 scores using various training and validation dataset combinations for all categories.
| Configuration | Model evaluation (Weighted F1 score) | ||||
|---|---|---|---|---|---|
| Training | Validation | MetaMap | DT | RF | LSTM |
| MIMIC | MIMIC | No | 0.60 | 0.64 | |
| Yes | 0.60 | 0.63 | |||
| CSU | CSU | No | 0.55 | 0.61 | |
| Yes | 0.54 | 0.60 | |||
| MIMIC | CSU | No | 0.22 | 0.24 | |
| Yes | 0.23 | 0.20 | |||
| CSU | MIMIC | No | 0.20 | 0.23 | |
| Yes | 0.28 | 0.19 | |||
| MIMIC + CSU | CSU | No | 0.57 | 0.62 | |
| Yes | 0.57 | 0.62 | |||
| MIMIC + CSU | MIMIC | No | 0.60 | 0.58 | |
| Yes | 0.60 | 0.60 | |||
| MIMIC + CSU | MIMIC + CSU | No | 0.59 | 0.64 | |
| Yes | 0.59 | 0.63 | |||
| 0.489 | 0.506 | ||||
Evaluation metrics for Decision Tree (DT), Random Forest (RF), and the FasTag Long Short Term Memory (LSTM) Recurrent Neural Network on validation datasets with and without MetaMap term extraction. Bolded and underlined numbers represent the best scores for the specific configuration of training data, validation data, and MetaMap toggle.
F1 scores using various training and validation dataset combinations for the “neoplasia” category.
| Configuration | Model evaluation (Weighted F1 score) | ||||
|---|---|---|---|---|---|
| Training | Validation | MetaMap | DT | RF | LSTM |
| MIMIC | MIMIC | No | 0.39 | 0.45 | |
| Yes | 0.4 | 0.45 | |||
| CSU | CSU | No | 0.81 | 0.86 | |
| Yes | 0.8 | 0.86 | |||
| MIMIC | CSU | No | 0.3 | 0.53 | |
| Yes | 0.45 | 0.37 | |||
| CSU | MIMIC | No | 0.46 | 0.58 | |
| Yes | 0.5 | 0.54 | |||
| MIMIC + CSU | CSU | No | 0.74 | 0.8 | |
| Yes | 0.74 | 0.8 | |||
| MIMIC + CSU | MIMIC | No | 0.4 | 0.47 | |
| Yes | 0.42 | 0.45 | |||
| MIMIC + CSU | MIMIC + CSU | No | 0.81 | 0.85 | |
| Yes | 0.81 | 0.86 | |||
| 0.574 | 0.637 | ||||
Evaluation metrics for the “neoplasia” category Decision Tree (DT), Random Forest (RF), and the FasTag Long Short Term Memory (LSTM) Recurrent Neural Network on validation datasets with and without MetaMap term extraction. Bolded and underlined numbers represent the best scores for the specific configuration of training data, validation data, and MetaMap toggle.