| Literature DB >> 29109070 |
Chia-Cheng Lee1, Sui-Lung Su2, Hsiang-Cheng Chen3, Chin Lin2,4, Chia-Jung Hsu1, Yu-Sheng Lou2, Shih-Jen Yeh5.
Abstract
BACKGROUND: Automated disease code classification using free-text medical information is important for public health surveillance. However, traditional natural language processing (NLP) pipelines are limited, so we propose a method combining word embedding with a convolutional neural network (CNN).Entities:
Keywords: convolutional neural network; data mining; electronic health records; electronic medical records; machine learning; natural language processing; neural networks (computer); text mining; word embedding
Mesh:
Year: 2017 PMID: 29109070 PMCID: PMC5696581 DOI: 10.2196/jmir.8344
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 5.428
Prevalence of different International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) chapter-level codes in discharge notes from the Tri-Service General Hospital, Taipei, Taiwan.
| Definition | Stage of the study | |||
| Before June 30, 2016 (n=64,023) | After July 1, 2016 (n=39,367) | Full study period (n=103,390) | ||
| A00-B99 | Certain infectious and parasitic diseases | 7731 (12.1%) | 5455 (13.9%) | 13,186 (12.8%) |
| C00-D49 | Neoplasms | 20,585 (32.2%) | 13,993 (35.5%) | 34,578 (33.5%) |
| D50-D89 | Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism | 4516 (7.1%) | 3132 (8.0%) | 7648 (7.4%) |
| E00-E89 | Endocrine, nutritional, and metabolic diseases | 13,223 (20.7%) | 8765 (22.3%) | 21,988 (21.3%) |
| F01-F99 | Mental, behavioral, and neurodevelopmental disorders | 4612 (7.2%) | 2942 (7.5%) | 7554 (7.3%) |
| G00-G99 | Diseases of the nervous system | 3703 (5.8%) | 2602 (6.6%) | 6305 (6.1%) |
| H00-H59 | Diseases of the eye and adnexa | 2337 (3.7%) | 1374 (3.5%) | 3711 (3.6%) |
| H60-H95 | Diseases of the ear and mastoid process | 802 (1.3%) | 470 (1.2%) | 1272 (1.2%) |
| I00-I99 | Diseases of the circulatory system | 17,650 (27.6%) | 11,465 (29.1%) | 29,115 (28.2%) |
| J00-J99 | Diseases of the respiratory system | 7743 (12.1%) | 5584 (14.2%) | 13,327 (13.0%) |
| K00-K95 | Diseases of the digestive system | 12,849 (20.1%) | 8444 (21.4%) | 21,293 (20.6%) |
| L00-L99 | Diseases of the skin and subcutaneous tissue | 2568 (4.0%) | 1711 (4.3%) | 4279 (4.1%) |
| M00-M99 | Diseases of the musculoskeletal system and connective tissue | 9170 (14.3%) | 5152 (13.1%) | 14,322 (13.9%) |
| N00-N99 | Diseases of the genitourinary system | 9929 (15.5%) | 7325 (18.6%) | 17,254 (16.8%) |
| O00-O9A | Pregnancy, childbirth, and the puerperium | 2509 (3.9%) | 1271 (3.2%) | 3780 (3.7%) |
| P00-P96 | Certain conditions originating in the perinatal period | 793 (1.2%) | 493 (1.3%) | 1286 (1.2%) |
| Q00-Q99 | Congenital malformations, deformations, and chromosomal abnormalities | 927 (1.4%) | 513 (1.3%) | 1440 (1.4%) |
| R00-R99 | Symptoms, signs, and abnormal clinical and laboratory findings, not elsewhere classified | 5271 (8.2%) | 3824 (9.7%) | 9095 (8.9%) |
| S00-T88 | Injury, poisoning, and certain other consequences of external causes | 6272 (9.8%) | 4564 (11.6%) | 10,836 (10.6%) |
| V00-Y99 | External causes of morbidity | 791 (1.2%) | 68 (0.2%) | 859 (0.8%) |
| Z00-Z99 | Factors influencing health status and contact with health services | 15,488 (24.2%) | 10,093 (25.6%) | 25,581 (24.8%) |
Figure 1Model architecture with 5 convolution channels and 1 full connection (FC) layer. ReLU: rectified linear unit.
Global (and lowest 5) means of training and testing AUCsa in the 5-fold cross-validation test.
| Pipeline | Training set | Testing set | |||
| AUCb | F-measure | AUCb | F-measure | ||
| NLPc + SVMd (linear) | 0.9947 (0.9836) | 0.9546 (0.8560) | 0.9571 (0.8891) | 0.8606 (0.6387) | |
| NLP + SVM (polynomial) | 0.8627 (0.6736) | 0.5630 (0.2498) | 0.8183 (0.6332) | 0.5050 (0.2023) | |
| NLP + SVM (radial basis) | 0.9565 (0.9146) | 0.7984 (0.6613) | 0.9363 (0.8582) | 0.7569 (0.5352) | |
| NLP + SVM (sigmoid) | 0.9518 (0.9021) | 0.7852 (0.6368) | 0.9325 (0.8526) | 0.7498 (0.5313) | |
| NLP + RFe | 0.9999 (0.9995)f | 0.9864 (0.9628) | 0.9570 (0.8800) | 0.8739 (0.6475) | |
| NLP + GBMg | 0.9996 (0.9990) | 0.9868 (0.9660) | 0.9544 (0.8722) | 0.8691 (0.6458) | |
| GloVeh + CNNi | 0.9964 (0.9890) | 0.9837 (0.9588) | 0.9696 (0.9135)f | 0.9086 (0.7651) | |
aAUC: area under the curve, calculated using the receiver operating characteristic curve.
bThe results are presented as the mean AUC or F-measure (mean of the lowest 5 AUCs or F-measures). Detailed AUCs and F-measures for each chapter-level International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis code are shown in Multimedia Appendix 2.
cNLP: natural language processing for feature extraction (terms, n-gram phrases, and SNOMED CT categories).
dSVM: support vector machine.
eRF: random forest.
fThe best method for a specific index.
gGBM: gradient boosting machine.
hGloVe: a 50-dimensional word embedding model, pretrained using English Wikipedia and Gigaword.
iCNN: convolutional neural network.
Global (and lowest 5) means of the training and testing AUCsa in the real-world test.
| Pipeline | Training set | Testing set | |||
| AUCb | F-measure | AUCb | F-measure | ||
| NLPc + SVMd (linear) | 0.9921 (0.9768) | 0.9365 (0.7983) | 0.9477 (0.8549) | 0.8458 (0.5984) | |
| NLP + SVM (polynomial) | 0.9103 (0.7975) | 0.6316 (0.4045) | 0.8716 (0.7400) | 0.5761 (0.2802) | |
| NLP + SVM (radial basis) | 0.9577 (0.9208) | 0.7954 (0.6484) | 0.9349 (0.8476) | 0.7588 (0.5258) | |
| NLP + SVM (sigmoid) | 0.9522 (0.9058) | 0.7840 (0.6261) | 0.9259 (0.8196) | 0.7515 (0.5209) | |
| NLP + RFe | 0.9996 (0.9985)f | 0.9869 (0.9664)f | 0.9483 (0.8484) | 0.8582 (0.5901) | |
| NLP + GBMg | 0.9995 (0.9985) | 0.9821 (0.9562) | 0.9462 (0.8416) | 0.8568 (0.5948) | |
| GloVeh + CNNi | 0.9956 (0.9868) | 0.9803 (0.9523) | 0.9645 (0.8952)f | 0.9003 (0.7204)f | |
aAUC: area under the curve, calculated using the receiver operating characteristic curve.
bThe results are presented as the mean AUC or F-measure (mean of the lowest 5 AUCs or F-measures). Detailed AUCs and F-measures for each chapter-level International Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) diagnosis code are shown in Multimedia Appendix 3.
cNLP: natural language processing for feature extraction (terms, n-gram phrases, and SNOMED CT categories).
dSVM: support vector machine.
eRF: random forest.
fThe best method for a specific index.
gGBM: gradient boosting machine.
hGloVe: a 50-dimensional word embedding model, pretrained using English Wikipedia and Gigaword.
iCNN: convolutional neural network.
Figure 2Visualization of selected convolving filters.
Figure 3Information gains of the features extracted by the convolving filters in each classification task. AUC: area under the curve; IG: information gain.