| Literature DB >> 31339103 |
Chin Lin1,2, Yu-Sheng Lou1,2, Dung-Jang Tsai1,2, Chia-Cheng Lee3, Chia-Jung Hsu3, Ding-Chung Wu4, Mei-Chuen Wang4, Wen-Hui Fang5.
Abstract
BACKGROUND: Most current state-of-the-art models for searching the International Classification of Diseases, Tenth Revision Clinical Modification (ICD-10-CM) codes use word embedding technology to capture useful semantic properties. However, they are limited by the quality of initial word embeddings. Word embedding trained by electronic health records (EHRs) is considered the best, but the vocabulary diversity is limited by previous medical records. Thus, we require a word embedding model that maintains the vocabulary diversity of open internet databases and the medical terminology understanding of EHRs. Moreover, we need to consider the particularity of the disease classification, wherein discharge notes present only positive disease descriptions.Entities:
Keywords: artificial intelligence; convolutional neural network; electronic health records; natural language processing; word embedding
Year: 2019 PMID: 31339103 PMCID: PMC6683650 DOI: 10.2196/14499
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Concept of the projection word embedding model.
Prevalence of different one–character-level International Classification of Diseases, Tenth Revision, Clinical Modification codes used in discharge notes in this study.
| ICD-10-CMa code | Definition | Dataset | |||
| Training setb (n=82,390), n (%) | Validation setc (n=12,145), n (%) | Testing set 1d (n=24,780), n (%) | Testing set 2e (n=74,332), n (%) | ||
| A00-B99 | Certain infectious and parasitic diseases | 14,883 (18.1) | 2296 (18.9) | 4713 (19) | 14,704 (19.8) |
| C00-D49 | Neoplasms | 29,125 (35.4) | 4405 (36.3) | 8721 (35.2) | 7220 (9.7) |
| D50-D89 | Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism | 8707 (10.6) | 1062 (8.7) | 2258 (9.1) | 7112 (9.6) |
| E00-E89 | Endocrine, nutritional, and metabolic diseases | 22,884 (27.8) | 3404 (28) | 6915 (27.9) | 21,866 (29.4) |
| F01-F99 | Mental, behavioral, and neurodevelopmental disorders | 7410 (9) | 1084 (8.9) | 2237 (9) | 9956 (13.4) |
| G00-G99 | Diseases of the nervous system | 7200 (8.7) | 987 (8.1) | 2270 (9.2) | 5332 (7.2) |
| H00-H59 | Diseases of the eye and adnexa | 3039 (3.7) | 430 (3.5) | 865 (3.5) | 873 (1.2) |
| H60-H95 | Diseases of the ear and mastoid process | 1044 (1.3) | 174 (1.4) | 312 (1.3) | 846 (1.1) |
| I00-I99 | Diseases of the circulatory system | 29,152 (35.4) | 4129 (34) | 8857 (35.7) | 28,509 (38.4) |
| J00-J99 | Diseases of the respiratory system | 15,455 (18.8) | 2068 (17) | 4602 (18.6) | 22,344 (30.1) |
| K00-K95 | Diseases of the digestive system | 20,621 (25) | 2969 (24.4) | 5956 24) | 22,500 (30.3) |
| L00-L99 | Diseases of the skin and subcutaneous tissue | 4217 (5.1) | 702 (5.8) | 1347 (5.4) | 5297 (7.1) |
| M00-M99 | Diseases of the musculoskeletal system and connective tissue | 12,030 (14.6) | 1697 (14) | 3525 (14.2) | 10,801 (14.5) |
| N00-N99 | Diseases of the genitourinary system | 19,454 (23.6) | 2782 (22.9) | 5934 (23.9) | 18,345 (24.7) |
| O00-O9A | Pregnancy, childbirth, and the puerperium | 2195 (2.7) | 311 (2.6) | 632 (2.6) | 1409 (1.9) |
| P00-P96 | Certain conditions originating in the perinatal period | 840 (1) | 106 (0.9) | 179 (0.7) | 375 (0.5) |
| Q00-Q99 | Congenital malformations, deformations, and chromosomal abnormalities | 1104 (1.3) | 152 (1.3) | 286 (1.2) | 444 (0.6) |
| R00-R99 | Symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified | 11,029 (13.4) | 1636 (13.5) | 3335 (13.5) | 13,027 (17.5) |
| S00-T88 | Injury, poisoning, and certain other consequences of external causes | 9949 (12.1) | 1539 (12.7) | 3239 (13.1) | 14,244 (19.2) |
| V00-Y99 | External causes of morbidity | 114 (0.1) | 4 (<0.1) | 4 (<0.1) | 12,548 (16.9) |
| Z00-Z99 | Factors influencing health status and contact with health services | 24,819 (30.1) | 4107 (33.8) | 8353 (33.7) | 15,346 (20.6) |
aICD-10-CM: International Classification of Diseases, Tenth Revision, Clinical Modification.
bTraining set includes samples collected between June 1, 2015, and March 22, 2017, from the Tri-Service General Hospital.
cValidation set 1 includes samples collected between March 23, 2017, and June 30, 2017, from the Tri-Service General Hospital.
dTesting set 1 includes samples between July 1, 2017, and December 31, 2017, from the Tri-Service General Hospital.
eTesting set 2 includes samples from the Taichung Armed Forces General Hospital, Taoyuan Armed Forces General Hospital, Taichung Armed Forces General Hospital Zhongqing Branch, Hualien Armed Forces General Hospital, Tri-Service General Hospital Penghu Branch, Tri-Service General Hospital SongShan Branch, and Zuoying Branch of Kaohsiung Armed Forces General Hospital.
Figure 2Model architectures in our experiments. ICD: International Classification of Diseases.
Figure 3Hybrid sampling method.
Pearson correlation coefficients between similarity scores of disease coding performed by human judgment and those calculated using four-word embeddings.
| Series and dataset | Embeddings | ||||||||||||||
| Original Wikipedia | Original PubMed | Original EHRa | Original EHR+Wikipedia | Original EHR+PubMed | Projection Wikipedia | Projection PubMed | |||||||||
| Hliaoutakis’ | 0.2820 | 0.4968 | 0.4815 | 0.3488 | 0.4914 | 0.3202 | 0.5255 | ||||||||
| MayoSRS | 0.0082 | 0.5087 | 0.6082 | 0.1948 | 0.6028 | 0.0930 | 0.5148 | ||||||||
| MiniMayoSRS | 0.3363 | 0.7200 | 0.6613 | 0.4746 | 0.7201 | 0.4709 | 0.5903 | ||||||||
| UMNSRS Relatedness | 0.2836 | 0.4891 | 0.4525 | 0.3808 | 0.4774 | 0.3378 | 0.4390 | ||||||||
| UMNSRS Relatedness - MODe | 0.2985 | 0.5094 | 0.5020 | 0.4015 | 0.5184 | 0.3678 | 0.4903 | ||||||||
| UMNSRS Similarity | 0.3032 | 0.4916 | 0.4617 | 0.3906 | 0.4868 | 0.3281 | 0.4071 | ||||||||
| UMNSRS Similarity - MOD | 0.3379 | 0.5271 | 0.4993 | 0.4304 | 0.5272 | 0.3733 | 0.4771 | ||||||||
aEHR: electronic health record.
bMeSH: Medical Subject Headings.
cMayoSRS: Mayo Medical Coders Set.
dUMNSRS: University of Minnesota Semantic Relatedness Set.
eMOD: modification.
Selected words and the corresponding five most similar words obtained from different word embedding models.
| Target word | Embeddings | ||||||
| Original Wikipedia | Original PubMed | Original EHRa | Original EHR+Wikipedia | Original EHR+PubMed | Projection Wikipedia | Projection PubMed | |
| Neoplasm | Malignant | Leiomyosarcoma | Neoplasms | Neoplasms | Neoplasms | Polyp | Angiosarcoma |
| Polyp | Angiosarcoma | Carcinoid | Mucinous | Carcinoid | Mucinous | Leiomyosarcoma | |
| Neoplasms | Malignancy | Lymphoepithelial | Malignant | Mucinous | Malignant | Lipoma | |
| Nematode | Malignant | Oncocytoma | Pheochromocytoma | Paraganglioma | Nematode | Acinic | |
| Mucinous | Neoplasms | Mucinous | Carcinoid | Oncocytoma | Cyst | Malignancy | |
| Hypertension | Diabetes | Hypertensive | Hyperlipidemia | Diabetes | Hypertensive | Diabetes | Hypertensive |
| Pulmonary | Renovascular | Dyslipidemia | Cardiovascular | Hyperlipidemia | Pulmonary | Dyslipidemia | |
| Cardiovascular | Cardiovascular | Hypertensive | Chronic | Dyslipidemia | Chronic | Mellitus | |
| Asthma | Normotension | HCVD | Pulmonary | Cardiovascular | Disease | Hyperlipidemia | |
| Chronic | Dyslipidemia | Hyperuricemia | Asthma | Hypercholesterolemia | Acute | Dyslipidemia | |
| Diabetes | Hypertension | Mellitus | Mellitus | Hypertension | Mellitus | Hypertension | Mellitus |
| Cancer | Diabetic | DM | Cardiovascular | Diabetics | Disease | Diabetics | |
| Asthma | Diabetics | Diabetics | Diabetics | Diabetic | Patients | Diabetic | |
| Obesity | Dyslipidemia | Diabetes | Mellitus | NIDDM | Hepatitis | IGT | |
| Alzheimer | Hyperlipidemia | Cardiovascular | Diabetic | Macrovascular | Treating | Nondiabetic | |
| Pneumonia | Respiratory | Pneumonias | Acquired | Respiratory | Pneumonias | Illness | Pneumonias |
| Illness | Bronchopneumonia | Community | Infection | Bacteremic | Respiratory | Bronchopneumonia | |
| Complications | Bacteremia | Healthcare | Hospitalized | Bacteremia | Infection | Bacteremia | |
| Bronchitis | Bacteremic | Aspiration | Infections | Acquired | SARS | Nosocomial | |
| Infection | Meningitis | Pneumonia | Illness | Bronchopneumonia | Hepatitis | Meningitis | |
| Sepsis | Meningitis | Septic | Septic | Septicemia | Septic | Hepatitis | Septic |
| Septicemia | Septicemia | Septicemia | Bacteremia | Bacteremia | Respiratory | Septicemia | |
| Jaundice | Peritonitis | Coli | Infection | Septicemia | Infection | Bacteremia | |
| Hepatitis | Polymicrobial | Bacteremia | Septicemia | Polymicrobial | Illness | Meningitis | |
| Diabetes | Mods | Epiglottitis | Meningitis | Septicemia | Jaundice | Polymicrobial | |
aEHR: electronic health record.
Results of the three–character-level ICD-10-CM coding task using different word embeddings (italicized font indicates the best precision, recall, and F-measure).
| Situations | Testing set 1a | Testing set 2b | ||||
| Precision | Recall | F-measure | Precision | Recall | F-measure | |
| a: EHRc | 0.7156 | 0.7724 | 0.7250 | 0.6852 | 0.6932 | 0.6574 |
| b: Wikipedia | 0.7106 | 0.7689 | 0.7213 | 0.6879 | 0.6743 | 0.6479 |
| c: PubMed | 0.6723 | 0.7725 | 0.6974 | 0.6491 | 0.6776 | 0.6260 |
| d: EHR+Wikipedia | 0.7066 | 0.7665 | 0.7208 | 0.6854 | 0.6797 | 0.6540 |
| e: Projection Wikipedia | 0.7177 | 0.7776 | 0.7316 | 0.6877 | 0.6929 | 0.6617 |
| f: Projection PubMed | 0.7070 | 0.7700 | 0.7187 | 0.6817 | 0.6908 | 0.6561 |
| g: Projection Wikipedia+Projection PubMed | 0.7809 | 0.7362 | 0.6994 | 0.6693 | ||
| h: Projection Wikipedia+Projection PubMed+Hybrid sampling | 0.7189 | 0.6826 | ||||
aTesting set 1 includes the samples collected between July 1, 2017, and December 31, 2017, from the Tri-Service General Hospital.
bTesting set 2 includes the samples from the Taichung Armed Forces General Hospital, Taoyuan Armed Forces General Hospital, Taichung Armed Forces General Hospital Zhongqing Branch, Hualien Armed Forces General Hospital, Tri-Service General Hospital Penghu Branch, Tri-Service General Hospital SongShan Branch, and Zuoying Branch of Kaohsiung Armed Forces General Hospital.
cEHR: electronic health record.
Figure 4Density plots of predictions of each single word provided by the model with and without hybrid sampling training.
ICD-10-CM coding results of selected models in several simulated discharge notes (italicized font indicates inconsistent predictions among the models with and without hybrid sampling training).a
| Example discharge note and expected result | Hybrid sampling training | ||
| Without (%)b | With (%)c | ||
| Z3A (100) | O34 (100) | ||
| O34 | Z37 (99) | Z37 (100) | |
| O34 | O34 (98) | Z3A (100) | |
| O34 | K85 (97) | K83 (99) | |
| K85 | K75 (96) | K85 (99) | |
| K75 | K83 (95) | K75 (99) | |
| Z37 | N/Ad | ||
| Z3A | N/A | ||
| N/A | N/A | ||
| O34 | Z37 (99) | O34 (100) | |
| Z37 | Z3A (99) | Z37 (100) | |
| Z3A | O34 (99) | Z3A (100) | |
| N/A | |||
aList of ICD-10-CM codes used: C17: malignant neoplasm of small intestine; O34: maternal care for abnormality of pelvic organs; O60: preterm labor; K83: other diseases of biliary tract; K85: acute pancreatitis; K75: other inflammatory liver diseases; Z37: outcome of delivery; Z3A: weeks of gestation; K91: intraoperative and postprocedural complications and disorders of digestive system, not elsewhere classified; C53: malignant neoplasm of cervix uteri.
bThe classification model trained by projection Wikipedia and PubMed embeddings (situation g in Table 4).
cThe classification model trained by projection Wikipedia and PubMed embeddings and hybrid sampling method (situation h in Table 4).
dN/A: not applicable.