| Literature DB >> 35767353 |
Pei-Fu Chen1,2, Kuan-Chih Chen1,3, Wei-Chih Liao1, Feipei Lai1,4,5, Tai-Liang He4, Sheng-Che Lin4, Wei-Jen Chen1, Chi-Yu Yang6,7, Yu-Cheng Lin8,9, I-Chang Tsai10, Chi-Hao Chiu11, Shu-Chih Chang12, Fang-Ming Hung13,14.
Abstract
BACKGROUND: The tenth revision of the International Classification of Diseases (ICD-10) is widely used for epidemiological research and health management. The clinical modification (CM) and procedure coding system (PCS) of ICD-10 were developed to describe more clinical details with increasing diagnosis and procedure codes and applied in disease-related groups for reimbursement. The expansion of codes made the coding time-consuming and less accurate. The state-of-the-art model using deep contextual word embeddings was used for automatic multilabel text classification of ICD-10. In addition to input discharge diagnoses (DD), the performance can be improved by appropriate preprocessing methods for the text from other document types, such as medical history, comorbidity and complication, surgical method, and special examination.Entities:
Keywords: International Classification of Diseases; algorithm; coding system; data mining; deep learning; electronic health record; medical records; multilabel text classification; natural language processing
Year: 2022 PMID: 35767353 PMCID: PMC9282222 DOI: 10.2196/37557
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Data counts of 5 types of documents.
Word counts of 5 types of documents.
| Document type | Maximal word count | Mean word count |
| Discharge diagnoses | 480 | 31 |
| Surgical method | 487 | 11 |
| Special examination | 2342 | 86 |
| Medical history | 586 | 149 |
| Comorbidity and complication | 338 | 5 |
Figure 2Data processing flow chart and the model architecture. BioBERT: bidirectional encoder representations from transformers for biomedical text mining. CLS: classification; CM: clinical modification; ICD: International Classification of Diseases; PCS: procedure coding system; T: token; Woutput: output weight; Wp: pooled weight.
Figure 3Data preprocessing framework of ICD-10-CM classification model. CM: clinical modification; CT: computed tomography; ER: emergency room; ICD: International Classification of Diseases; L: lumbar; LAR: low anterior resection; OS: oculus sinister.
Figure 4Data preprocessing framework of ICD-10-PCS classification model. AR: aortic regurgitation; CAD: coronary arterial disease; CBD: common bile duct; CV: cardiovascular; EGD: esophagogastroduodenoscopy; EGJ: esophago-gastric junction; GB: gall bladder; ICD: International Classification of Diseases; IHD: intrahepatic duct; IV, intravenous; LA: left atrium; LV: left ventricle; LVEF: left ventricular ejection fraction; MR: mitral regurgitation; PCS: procedure coding system; PV: portal vein; R/O: rule out; s/p: status post; TKR: total knee replacement; TR: tricuspid regurgitation.
Comparison of different preprocessing methods for BioBERTa model on ICDb-10-CMc. Preprocessing methods are added one by one and 95% CIs are calculated by bootstrapping.
| Preprocessing method | Micro F1 score (95% CI) | Microprecision (95% CI) | Microrecall (95% CI) | AUROCd (95% CI) |
| Baseline | 0.749 (0.744-0.753) | 0.836 (0.832-0.840) | 0.678 (0.672-0.684) | 0.839 (0.835-0.842) |
| +Trained with definition | 0.759 (0.754-0.763) | 0.833 (0.829-0.838) | 0.696 (0.690-0.702) | 0.848 (0.845-0.851) |
| +External cause codes removal | 0.763 (0.759-0.767) | 0.843 (0.840-0.846) | 0.697 (0.691-0.702) | 0.849 (0.846-0.851) |
| +Number converting | 0.767 (0.761-0.772) | 0.845 (0.840-0.849) | 0.702 (0.695-0.708) | 0.851 (0.847-0.854) |
| +Combination code filter | 0.769 (0.764-0.773) | 0.845 (0.841-0.850) | 0.706 (0.699-0.711) | 0.853 (0.849-0.855) |
aBioBERT: bidirectional encoder representations from transformers for biomedical text mining.
bICD: International Classification of Diseases.
cCM: clinical modification.
dAUROC: area under the receiver operating characteristic curve.
Comparison of different preprocessing methods for BioBERTa model on ICDb-10-PCSc. The 95% CIs are calculated by bootstrapping.
| Preprocessing method | Micro F1 score (95% CI) | Microprecision (95% CI) | Microrecall (95% CI) | AUROCd (95% CI) |
| DDe | 0.670 (0.663-0.678) | 0.756 (0.750-0.761) | 0.601 (0.593-0.610) | 0.800 (0.796-0.805) |
| SMf | 0.618 (0.607-0.627) | 0.750 (0.741-0.762) | 0.524 (0.512-0.534) | 0.762 (0.756-0.767) |
| SM or DD | 0.714 (0.708-0.721) | 0.790 (0.784-0.791) | 0.651 (0.644-0.660) | 0.826 (0.822-0.830) |
| (SM+SEg) or DD | 0.724 (0.718-0.730) | 0.801 (0.794-0.808) | 0.661 (0.654-0.668) | 0.830 (0.827-0.834) |
| (SM+SE) or (DD+SE) | 0.726 (0.719-0.732) | 0.803 (0.797-0.810) | 0.661 (0.654-0.669) | 0.831 (0.827-0.834) |
aBioBERT: bidirectional encoder representations from transformers for biomedical text mining.
bICD: International Classification of Diseases.
cPCS: procedure coding system.
dAUROC: area under the receiver operating characteristic curve.
eDD: discharge diagnoses.
fSM: surgical method.
gSE: special examination.