| Literature DB >> 30709336 |
Xiaozheng Li1, Huazhen Wang2, Huixin He1, Jixiang Du1, Jian Chen3, Jinzhun Wu4.
Abstract
BACKGROUND: Benefiting from big data, powerful computation and new algorithmic techniques, we have been witnessing the renaissance of deep learning, particularly the combination of natural language processing (NLP) and deep neural networks. The advent of electronic medical records (EMRs) has not only changed the format of medical records but also helped users to obtain information faster. However, there are many challenges regarding researching directly using Chinese EMRs, such as low quality, huge quantity, imbalance, semi-structure and non-structure, particularly the high density of the Chinese language compared with English. Therefore, effective word segmentation, word representation and model architecture are the core technologies in the literature on Chinese EMRs.Entities:
Keywords: Chinese electronic medical records; Convolutional neural networks; Natural language processing
Mesh:
Year: 2019 PMID: 30709336 PMCID: PMC6359854 DOI: 10.1186/s12859-019-2617-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schema of our proposed framework. NLP technology involves a series of operations, which includes word segmentation, word embedding and model training
Distribution of datasets with respect to four types of classification applications for pediatric Chinese EMRs
| Number of diseases | Name of diseases | Number of samples |
|---|---|---|
| 7 | Allergic rhinitis, bronchitis, acute bronchitis, respiratory disease, bronchial asthma, no critical, diarrhea, cough variant asthma | 49,148 |
| 8 | 92,744 | |
| 32 | See Additional file | 132,637 |
| 63 | See Additional file | 144,170 |
Boldface represents an additional disease compared with the seven-classification application
Fig. 2Semantic rationality of whether to use our medical dictionary
One-layer CNN accuracy for different dimensions with respect to four types of classification applications
| Text classification | 50 (%) | 80 (%) | 100 (%) |
|---|---|---|---|
| 7 classes |
| 83.65 | 83.63 |
| 8 classes | 82.26 |
| 82.51 |
| 32 classes | 73.13 | 73.44 |
|
| 63 classes | 70.39 | 71.06 |
|
Boldface represents the best
Semantic similarity of word vectors
| Word | Cosine distance |
|---|---|
| Recurrent cough | 0.6350 |
| Quiet cough | 0.6196 |
| Bad cough | 0.5433 |
| Little cough | 0.5204 |
| Dry cough | 0.5208 |
| Nasal obstruction | 0.5914 |
| Phlegm | 0.5434 |
| Vomiting | 29.48 |
| Afternoon | 23.41 |
| Muscular stiffness | 22.83 |
Fig. 3Structure of a CNN. Different from the traditional feed-forward neural network, a CNN is a multi-layer neural network, which includes four parts: embedding layer, convolution layer, pooling layer and fully connected layer
Comparative results of the CNN model with the seven-classification application
| Depth | One-layer CNN(%) | Two-layer CNN(%) | Three-layer CNN(%) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Fold ∖metrics | Precision | Accuracy | F1-score | Precision | Accuracy | F1-score | Precision | Accuracy | F1-score |
| 1 | 84.26 | 84.1 | 84.16 | 83.13 | 82.9 | 82.97 | 83.05 | 82.74 | 82.84 |
| 2 | 83.63 | 83.5 | 83.52 | 82.65 | 82.42 | 82.5 | 82.32 | 81.53 | 81.66 |
| 3 | 83.86 | 83.55 | 83.61 | 82.54 | 82.26 | 82.35 | 79.09 | 78.89 | 78.94 |
| 4 | 84.07 | 83.75 | 83.84 | 82.78 | 82.51 | 82.58 | 82.28 | 82.02 | 82.05 |
| 5 | 83.87 | 83.71 | 83.76 | 82.97 | 82.81 | 82.85 | 82.6 | 82.37 | 82.4 |
| Average | 83.94 | 83.72 | 83.78 | 82.81 | 82.58 | 82.65 | 81.87 | 81.51 | 81.58 |
Fig. 4Confusion matrix of the three CNN models. a normalized confusion matrix of one-layer CNN. b unnormalized confusion matrix of one-layer CNN. c normalized confusion matrix of two-layer CNN. d normalized confusion matrix of three-layer CNN
Results of our CNN models against other methods
| Model | Precision(%) | Accuracy(%) | F1-score(%) |
|---|---|---|---|
| 1-layer CNN |
|
|
|
| 1-layer LSTM | 43.97 | 46.33 | 38.18 |
| 1-layer GRU | 82.95 | 82.2 | 82.37 |
| 2-layers CNN | 82.81 | 82.58 | 82.65 |
| 2-layers LSTM | 23.01 | 34.12 | 19.57 |
| 2-layers GRU | 83.03 | 82.4 | 82.57 |
| 3-layers CNN | 81.87 | 81.51 | 81.58 |
| CNN-1LSTM | 83.86 | 83.55 | 83.62 |
| CNN-2LSTM | 83.63 | 83.18 | 83.33 |
| CNN-1GRU | 83.42 | 83.02 | 83.13 |
| CNN-2GRU | 83.52 | 82.95 | 83.1 |
Boldface represents the best
Accuracies of fine-tuning the one-layer CNN model with respect to four types of classification applications
| The number of diseases | precision(%) | accuracy(%) | F1-score(%) |
|---|---|---|---|
| 7 classes |
|
|
|
| 8 classes | 82.35 | 82.55 | 82.27 |
| 32 classes | 73.09 | 73.54 | 72.5 |
| 63 classes | 70.59 | 71.2 | 69.61 |
Boldface represents the best
Fig. 5Description of a typical pediatric Chinese EMR datum
Fig. 6Impact of three types of parameter on the accuracy of the CNN model. Note: “pre” refers to head-filling or head-truncation and “post” refers to tail-filling or tail-truncation. For example, “pre_post” means that short text is filled by head and long text is truncated by tail
Comparative accuracies with respect to the seven-classication application and the eight-classication application of whether to use class weights
| Class ∖metrics | Name of class | Sample size | Seven-classication | Eight-classication | ||
|---|---|---|---|---|---|---|
| Without class weight | With class weight | Without class weight | With class weight | |||
| Class1 | Allergic rhinitis | 1079 | 71.09 |
| 59.68 |
|
| Class2 | Respiratory disease | 11980 |
| 87.92 | 85.28 |
|
| Class3 | Cough variant asthma | 1418 | 70.31 |
| 67.12 |
|
| Class4 | Acute bronchitis | 11990 | 77.5 |
| 65.56 |
|
| Class5 | Bronchialasthma, no critical | 1550 | 79.23 |
| 78.82 |
|
| Class6 | Bronchitis | 17726 |
| 73.42 |
| 51.42 |
| Class7 | Diarrhea | 3405 | 97.91 |
| 94.9 |
|
| Class8 | Acute upper respiratory tract infection | 43596 | NA | NA |
| 84.11 |
Boldface represents the best
Comparative results with respect to the seven-classication application and the eight-classication application of whether to use different class weights
| Metrics | Seven-classication | Eight-classication | ||
|---|---|---|---|---|
| Without class weight | With class weight | Without class weight | With class weight | |
| Precision (%) |
| 82.27 |
| 80.97 |
| Accuracy (%) |
| 80.99 |
| 78.15 |
| F1-score (%) |
| 81.25 |
| 78.45 |
Boldface represents the best