| Literature DB >> 35022039 |
Zhao Shuai1, Diao Xiaolin1, Yuan Jing2, Huo Yanni1, Cui Meng1, Wang Yuxin1, Zhao Wei3.
Abstract
BACKGROUND: Automated ICD coding on medical texts via machine learning has been a hot topic. Related studies from medical field heavily relies on conventional bag-of-words (BoW) as the feature extraction method, and do not commonly use more complicated methods, such as word2vec (W2V) and large pretrained models like BERT. This study aimed at uncovering the most effective feature extraction methods for coding models by comparing BoW, W2V and BERT variants.Entities:
Keywords: Automated ICD coding; BERT; Bag-of-words; Feature extraction; Interpretability; Word2vec
Mesh:
Year: 2022 PMID: 35022039 PMCID: PMC8756659 DOI: 10.1186/s12911-022-01753-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Word2vec. Left: CBOW. Right: skip-gram
Fig. 2The unit layer of the encoder of Transformer
Descriptive statistics of the datasets
| Fuwai | CodiEsp | ||||
|---|---|---|---|---|---|
| Word | Character | Code | Word | Code | |
| Token size | 691,418 | 1,557,769 | 44,366 | 161,078 | 11,158 |
| Vocabulary size | 9130 | 1768 | 1532 | 14,885 | 2557 |
| Average length | 99.5 | 224.2 | 6.4 | 161.1 | 11.2 |
Fig. 3The distribution of code frequencies in the datasets
Fig. 4The most frequent 10 codes in the datasets
Fig. 5The framework of our methodology
Parameters for training W2V embeddings
| Parameters | Descriptions | For Fuwai dataset | For CodiEsp dataset |
|---|---|---|---|
| Whether Skip-gram is used. | 1 | 1 | |
| The dimension of resulting embeddings. | 256 | 128 | |
| The length of a text window. | 5 | 5 | |
| Training on data for how many iterations. | 5 | 5 | |
| Discarding words/characters appearing less than how many times. | 3 | 3 |
Coding results for Fuwai dataset with
| Feature extraction & classifiers | ||||
|---|---|---|---|---|
| 84.44 | 91.54 | 88.58 | 93.75 | |
| 84.69 | 91.78 | 89.27 | 94.10 | |
| 84.83 | 89.08 | 94.41 | ||
| 83.02 | 91.57 | 88.23 | 93.93 | |
| 83.01 | 91.50 | 88.00 | 93.88 | |
| 78.21 | 89.45 | 85.20 | 92.19 | |
| 53.14 | 75.07 | 71.60 | 82.05 | |
| 35.73 | 64.92 | 64.09 | 75.10 | |
| 48.03 | 70.54 | 68.77 | 79.04 | |
| 26.30 | 58.86 | 60.37 | 71.64 | |
| 61.73 | 75.75 | 85.47 | ||
| 46.26 | 73.68 | 69.17 | 80.51 | |
| 64.56 | 78.59 | 77.90 | 85.51 | |
| 51.30 | 75.24 | 71.86 | 82.45 | |
| 72.41 | 82.23 | 89.07 | ||
| 64.25 | 81.41 | 77.57 | 86.44 | |
| 4.31 | 40.59 | 69.56 | 80.32 | |
| 83.39 | 98.65 | 99.55 | ||
For BoW, _uni, _uni_bi and _uni_bi_tri mean unigram, unigram+bigram and unigram+bigram+trigram respectively. For W2V, _comb means concatenating character and word embeddings, while _char (_word) means merely character (word) embeddings. For RoBERTa_embeddings, _char means merely the RoBERTa-Mini embeddings, and _comb means concatenating the RoBERTa-Mini embeddings and W2V word embbeddings. For RoBERTa_finetune, whole and top_layer mean fine-tuning the whole network and only the top fully connected layer respectively
Coding results for CodiEsp dataset with
| Feature extraction & classifiers | ||||
|---|---|---|---|---|
| 63.55 | 63.68 | 70.44 | 70.04 | |
| 70.68 | 70.34 | 75.13 | 74.45 | |
| 63.93 | 63.85 | 72.36 | 71.08 | |
| 72.46 | 77.22 | 75.95 | ||
| 62.41 | 62.26 | 71.39 | 69.97 | |
| 69.48 | 69.26 | 75.39 | 73.90 | |
| 56.07 | 56.07 | 64.39 | 64.62 | |
| 59.52 | 66.86 | 67.11 | ||
| 64.00 | 69.33 | 68.61 | ||
| 59.15 | 59.02 | 64.29 | 64.11 | |
| 61.26 | 60.91 | 66.32 | 65.85 | |
| 62.52 | 62.45 | 67.68 | 67.73 | |
| 17.21 | 22.19 | 48.79 | 49.40 | |
| 85.32 | 91.44 | 92.82 | ||
Aside from BERT_embeddings, the suffixes have the same meanings as those in Table 3. For BERT_embeddings, _word means merely the BERT-mini embeddings, and _comb means concatenating the BERT-mini embeddings and W2V word embbeddings
Coding results for Fuwai dataset with
| Feature extraction & classifiers | ||||
|---|---|---|---|---|
| 52.95 | 71.82 | 86.88 | ||
| 47.41 | 82.70 | 70.46 | 86.73 | |
| 46.25 | 80.79 | 69.10 | 85.48 | |
| 37.70 | 79.07 | 66.11 | 84.13 | |
| 39.12 | 72.85 | 65.93 | 80.49 | |
| 27.24 | 67.62 | 61.68 | 76.77 | |
| 22.81 | 63.29 | 58.79 | 74.85 | |
| 12.74 | 53.46 | 55.07 | 68.99 | |
| 19.16 | 58.43 | 57.17 | 72.09 | |
| 8.16 | 45.92 | 53.19 | 65.40 | |
| 29.08 | 61.32 | 78.13 | ||
| 16.92 | 61.84 | 56.97 | 73.39 | |
| 34.75 | 69.03 | 63.89 | 79.25 | |
| 23.41 | 64.75 | 59.58 | 75.74 | |
| 39.44 | 74.32 | 66.00 | 82.17 | |
| 29.64 | 62.16 | 79.01 | ||
| 0.67 | 31.06 | 62.83 | 84.21 | |
| 2.43 | 75.00 | 90.26 | ||
Coding results for CodiEsp dataset with
| Feature extraction & classifiers | ||||
|---|---|---|---|---|
| 5.96 | 13.81 | 51.81 | 53.72 | |
| 24.06 | 58.61 | 62.61 | ||
| 2.42 | 6.29 | 50.70 | 51.62 | |
| 12.79 | 23.56 | 54.39 | 56.75 | |
| 1.55 | 4.04 | 50.43 | 51.03 | |
| 8.14 | 14.76 | 52.70 | 54.00 | |
| 0.57 | 50.15 | 50.57 | ||
| 0.00 | 0.00 | 50.00 | 50.00 | |
| 15.81 | 22.77 | 55.42 | 57.28 | |
| 15.34 | 21.39 | 56.00 | 57.55 | |
| 17.71 | 56.25 | 58.78 | ||
| 18.24 | 25.75 | 57.43 | 59.68 | |
| 0.01 | 0.04 | 52.70 | 65.02 | |
| 1.72 | 68.40 | 74.87 | ||
Fig. 6The metrics for the feature extraction methods on Fuwai dataset
Fig. 7The metrics for the feature extraction methods on CodiEsp dataset
for interpreting the code assignment by RoBERTa-Mini
| 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|
| 73.2% | 74.2% | 74.7% | 75.2% | 75.0% |
for interpreting BoW
| 10 | 20 | 30 | 40 | 50 | |
|---|---|---|---|---|---|
| 80.0% | 80.0% | 83.0% | 80.0% | 76.0% |
10 key words and the target ICD codes
| Key words | Target codes |
|---|---|
| fever, disease, pain, antibiotic, drainage, crp, painful, leukocytosis, vas, pleural | r52(pain, not elsewhere classified), r69(illness unspecified), r50.9(Fever, unspecified) |
A clinical record for interpreting the code assignment by BERT-mini
| Case description | ICD code |
|---|---|
| year male patient evaluated pain grade iii obliterating arteriopathy involvement limbs received analgesic treatment | r52 (pain not elsewhere classified) |