| Literature DB >> 36248956 |
Yuekun Ma1,2,3, Yun Liu4, Dezheng Zhang1,3, Jiye Zhang2, He Liu2, Yonghong Xie1,3.
Abstract
Recognition of Traditional Chinese Medicine (TCM) entities from different types of literature is challenging research, which is the foundation for extracting a large amount of TCM knowledge existing in unstructured texts into structured formats. The lack of large-scale annotated data makes unsatisfactory application of conventional deep learning models in TCM text knowledge extraction. Some other unsupervised methods rely on other auxiliary data, such as domain dictionaries. We propose a multigranularity text-driven NER model based on Conditional Generation Adversarial Network (MT-CGAN) to implement TCM NER with small-scale annotated corpus. In the model, a multigranularity text features encoder (MTFE) is designed to extract rich semantic and grammatical information from multiple dimensions of TCM texts. By differentiating the conditional constraints of the generator and discriminator of MT-CGAN, the synchronization between the generated tag labs and the named entities is guaranteed. Furthermore, seeds of different TCM text types are introduced into our model to improve the precision of NER. We compare our method with other baseline methods to illustrate the effectiveness of our method on 4 kinds of gold-standard datasets. The experiment results show that the standard precision, recall, and F1 score of our method are higher than the state-of-the-art methods by 0.24∼8.97%, 0.89∼12.74%, and 0.01∼10.84%. MT-CGAN is able to extract entities from different types of TCM literature effectively. Our experimental results indicate that the proposed approach has a clear advantage in processing TCM texts with more entity types, higher sparsity, less regular features, and a small-scale corpus.Entities:
Mesh:
Year: 2022 PMID: 36248956 PMCID: PMC9553443 DOI: 10.1155/2022/1495841
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Model framework. || is a splicing operation, + is a contraposition addition operation. C_gra, S_gra, P_gra, and A_gra denote characters, sentences, paragraphs, and chapters of text, respectively. FCL represents a fully connected layer. CNNL represents the convolutional neural network Layer, and CEL denotes the cross-entropy loss function.
Figure 2Network structure of each component of the MTFE. (a) CGFE (b) SGFE (c) AGFE.
Typical types of entities contained in different types of TCM texts.
| Type of literature | Typical entity types and corresponding labels |
|---|---|
| Canon of medicine | Cognitive method (RSFF), traditional Chinese physiology (ZYSL), traditional Chinese pathology (ZYBL), the principle of treatment (ZZ), method of treatment (ZF) |
| Medical cases | Chinese materia medica (ZY), symptoms (ZZ), pulse (MX), tongue (SX), formula (FJ), dosis (FJ) |
| Herbal | Drug property (YX), flavor of medicinals (YW), channel tropism (GJ), virtue (GX), dilantin (YM), symptoms (ZZ) |
| Comprehensive | Pathogeny (BY), disease (JB), syndrome (ZH), pulse (MX), tongue (SX), formula (FJ), Chinese materiamedica (ZY) |
Figure 3A schematic representation of entity distribution features.
Figure 4The structure diagram of the C-U-NET. FCL represents a fully connected layer.
The scale of annotated corpus of the experimental dataset.
| Name | Type | Text size (10,000 words) | Size of annotated corpus (10,000 words) |
|---|---|---|---|
|
| Herbal | 11.9 | 2.6 |
|
| Medical cases | 25 | 10 |
|
| Canon of medicine | 7.9 | 7.9 |
|
| General | 564.4 | 28.6 |
Experimental results under different models on the Shennong's classic of material medical and medical cases of famous doctors in different periods of China datasets.
| Model | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
|---|---|---|---|---|---|---|
|
|
| |||||
| BiLSTM-CRF | 86.32 | 87.67 | 86.99 | 87.88 | 89.43 | 88.65 |
| BERT-BiLSTM-CRF | 89.56 | 90.05 | 89.80 | 89.74 | 92.07 | 90.89 |
| Roberta-c | 89.78 | 90.15 | 89.96 | 90.12 | 90.59 | 90.35 |
| MT-CGAN | 90.26 | 89.78 | 90.02 | 90.36 | 91.28 | 90.81 |
Experimental results under different models on the Miraculous pivot and syndrome in TCM datasets.
| Model | P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) |
|---|---|---|---|---|---|---|
|
|
| |||||
| BiLSTM-CRF | 68.76 | 66.95 | 67.85 | 72.38 | 70.95 | 71.65 |
| BERT-BiLSTM-CRF | 76.65 | 80.72 | 78.64 | 72.33 | 74.30 | 73.30 |
| Roberta-c | 76.98 | 80.47 | 78.68 | 74.62 | 75.39 | 75.00 |
| MT-CGAN | 77.73 | 79.69 | 78.69 | 78.45 | 76.28 | 77.35 |
Experimental results of different combination strategies.
| Combination strategies | F1 (%) |
|---|---|
| Sentence & paragraph | 69.49 |
| Sentence & paragraph &chapter | 71.25 |
| Character & sentence | 74.86 |
| Character & paragraph | 73.17 |
| Character & sentence & paragraph | 76.48 |
| Character & sentence & paragraph & chapter | 77.35 |
Comparison of using the seed of entity distribution features or not.
| Label seed feature | F1 (%) |
|---|---|
| Random | 75.16 |
| Reflect entity distribution of different types of TCM text | 77.35 |
Experimental results of different main inputs.
| Main input | F1 (%) |
|---|---|
| White noise | 71.62 |
| Attention data | 77.35 |
Experimental results of different condition inputs.
| G | D | F1 (%) | Epoch |
|---|---|---|---|
|
|
| 72.79 | 20–25 |
|
|
| 77.35 | 15–20 |
L: the sequence of named entities labels for the corresponding text sequence. Y: the fusion of multigranularity text features and the seed of entity distribution features of the corresponding type of TCM text.
Figure 5Comparison of experimental results on different corpus scale. (a) On miraculous pivot dataset (b) on medical cases of famous doctors in different periods of China dataset.