| Literature DB >> 29065612 |
Jun Liang1, Xuemei Xian2, Xiaojun He1, Meifang Xu3, Sheng Dai4, Jun'yi Xin5, Jie Xu1, Jian Yu1, Jianbo Lei6,7.
Abstract
Medical entity recognition, a basic task in the language processing of clinical data, has been extensively studied in analyzing admission notes in alphabetic languages such as English. However, much less work has been done on nonstructural texts that are written in Chinese, or in the setting of differentiation of Chinese drug names between traditional Chinese medicine and Western medicine. Here, we propose a novel cascade-type Chinese medication entity recognition approach that aims at integrating the sentence category classifier from a support vector machine and the conditional random field-based medication entity recognition. We hypothesized that this approach could avoid the side effects of abundant negative samples and improve the performance of the named entity recognition from admission notes written in Chinese. Therefore, we applied this approach to a test set of 324 Chinese-written admission notes with manual annotation by medical experts. Our data demonstrated that this approach had a score of 94.2% in precision, 92.8% in recall, and 93.5% in F-measure for the recognition of traditional Chinese medicine drug names and 91.2% in precision, 92.6% in recall, and 91.7% F-measure for the recognition of Western medicine drug names. The differences in F-measure were significant compared with those in the baseline systems.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29065612 PMCID: PMC5516712 DOI: 10.1155/2017/4898963
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Drug name entities and definitions of medication events.
| Semantic type | Definition |
|---|---|
| Drug name entity | Names of TCM or WM drugs used in clinical treatment, including single drugs or drug combinations, general names of drugs, and TCM-specific decoctions and paste formula |
| Name of WM drug | Drugs used in WM, including chemical name, trade name, common name, and prescription name |
| Dose of WM | Dosage of WM for each patient each time as per doctor's advice |
| WM use method | Methods to use WM by a patient as per the doctor's advice |
| WM use frequency | Time interval for use of a single dose of WM as per the doctor's advice |
| Name of TCM drug | Different types of TCM products made from TCM materials or with TCM materials as raw materials, in TCD or TCM-WM combined treatment |
| Dose of TCM | A TCM-exclusive concept, including 3 cases: potions, tablets, and pills. Weight of TCM potions and number of TCM tablets or pills following the doctor's advice |
| TCM drug form | A TCM-exclusive concept or a TCM application mode in adaptation to requirements of treatment or prevention following the doctor's advice |
| TCM overall dose | A TCM-exclusive concept or the total number of medications following the doctor's advice |
| TCM use frequency | Time interval for use of a single dose of TCM following the doctor's advice |
| TCM use method | Methods to use TCM by a patient following the doctor's advice |
| TCM use requirements | A TCM-exclusive concept or the conditions met by a patient to use TCM following the doctor's advice |
Name annotation rules of core drugs.
| Number | Rule description |
|---|---|
| 1 | Drug information should be recorded in the ANs, including the names of disease-treating or symptom-relieving drugs (e.g., TCM, WM, and biological agents). Drug name defined here describes one drug, one drug combination, or one medical product |
| 2 | The modifiers indicating a change of drug use or a patient's drug use duration should not be included in the annotated drug name |
| 3 | Drug name entity is annotated by 1 phrase. For instance, the drug “nifedipine GITS” is usually annotated by two phrases: nifedipine and GITS, while here we annotate the whole drug name phrase as one drug name entity or namely the whole entity should be ascribed as one phrase |
| 4 | For TCM, Chinese characters indicating drug forms, such as “丸” (pill), “粉” (powder), and “汤” (decoction), cannot be annotated as single characters, because they are usually placed as the last characters within certain TCM drug names, and thus, the drug names should be annotated as a whole. For example, in the TCM drug name “大青龙汤” (da qin long decoction), “da qin long” is the pinyin to Chinese characters “大青龙,” while Chinese character “汤” is a drug form meaning “decoction” |
| 5 | The explicit negative modifiers around the drug names are not included in the annotated drug name entity |
| 6 | When Chinese drug name and the corresponding English name coexist in one short description without other words between them, they are jointly annotated as one drug name entity |
| 7 | When Chinese drug name and the corresponding English name coexist in one short description with simple symbols such as “/” or “-” between them, they are jointly annotated as one drug name entity |
| 8 | We also have seen parallel construction or ellipsis construction in some drug names. If two drug names are connected by one conjunction, the two drug names should be annotated as two separate drug entities |
| 9 | In some situations, certain words and punctuations in a drug name entity are ignored. Then, the following rules are used: |
| 10 | If two or more valid drug names end with the same characters and are combined together, then, the last drug name with the ending characters is taken as one complete drug name. For instance, in a description of two drug names ofloxacin and vitamin C injection, vitamin C injection is recognized as one complete drug name entity |
| 11 | Drug names usually contain figures, letters, and other symbols. Since these symbols represent drug-related information (e.g., (), <>), they are included in one drug name entity |
| 12 | When the TCM name and the description of the producing area coexist in the drug name, the information of the producing area is ignored. For instance, in Chuan Bei Mu (Zhejiang), Zhejiang is ignored |
| 13 | Specification may follow a drug name that does not belong to a drug name entity and may not need separate annotation. For example, in the drug name Cold Clear Capsule (a capsule with 24 mg of paracetamol), the specification is in brackets |
| 14 | Maximum annotation length of drug name entities should be set and followed, except when such a limitation of annotation length destroys the validity of the grammar structure. Especially, when modifiers of a drug name contain special information about a brand and pattern and form an agglutinate structure within the drug name, then, these modifiers should be included in the drug name entity. For example, in the drug name “苗泰小儿柴桂退烧颗粒” (pinyin translation is “miao tai xiao er chai gui tui sao ke li”), “苗泰” (pinyin: “miao tai”) is a drug brand and should not be excluded |
Dataset scales used in this study.
| Dataset name | Number of ANs | Number of sentences | Number of sentences mentioning drug name | Number of annotated WM drug name entities | Number of annotated TCM drug name entities | Number of annotated drug name entities |
|---|---|---|---|---|---|---|
| Training set | 648 | 40,649 | 1665 | 1322 | 487 | 1809 |
| Test set | 324 | 20,397 | 716 | 581 | 209 | 790 |
| Total | 972 | 61,046 | 2381 | 1903 | 696 | 2599 |
Figure 1High-level architecture for the CCMER.
List of various features for the drug name recognizer.
| Feature set | Features | Description |
|---|---|---|
| F1-1 | CWS = 1: | The 1-gram, 2-gram, and 3-gram of the character text at CWS = 1 |
| F1-2 | CWS = 2: | The 1-gram, 2-gram, and 3-gram of the character text at CWS = 2 |
| F1-3 | CWS = 3: | The 1-gram, 2-gram, and 3-gram of the character text at CWS = 3 |
| F1-4 | CWS = 1: | The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 1 |
| F1-5 | CWS = 2: | The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 2 |
| F1-6 | CWS = 3: | The 1-gram, 2-gram, and 3-gram of the pinyin corresponding to the current character at CWS = 3 |
| F2-1 | InDictTCM | Are the current character and the surrounding characters contained in the TCM dictionary? |
| F2-2 | InDictTCMPinyin | Are the pinyins corresponding to the current character and the surrounding characters contained in the TCM dictionary? |
| F2-3 | InDictWM | Are the current character and the surrounding characters contained in the WM dictionary? |
| F2-4 | InDictWMPinyin | Are the pinyins corresponding to the current character and the surrounding characters contained in the WM dictionary? |
| F3-1 | CurC | Do the current character and subsequent characters contain the TCM dosage unit |
| F3-2 | CurC | Do the current character and subsequent characters contain the WM dosage unit |
| F3-3 | PreC | Do the characters before the current character contain the TCM dosage unit |
| F3-4 | PreC | Do the characters before the current character contain the WM dosage unit |
| F3-5 | CurC | Do the current character and subsequent characters contain the TCM usage term |
| F3-6 | CurC | Do the current character and subsequent characters contain the WM usage term |
| F3-7 | PreC | Do the characters before the current character contain the TCM usage term |
| F3-6 | PreC | Do the characters before the current character contain the WM usage term |
| F3-9 | CurC | Do the current character and subsequent characters contain the TCM drug form unit |
| F3-10 | CurC | Do the current character and subsequent characters contain the WM drug form unit |
| F3-11 | PreC | Do the characters before the current character contain the TCM drug form unit |
| F3-12 | PreC | Do the characters before the current character contain the WM drug form unit |
| F3-13 | CurC | Do the current character and subsequent characters contain the TCM frequency description |
| F3-14 | CurC | Do the current character and subsequent characters contain the WM frequency description |
| F3-15 | PreC | Do the characters before the current character contain the TCM frequency description |
| F3-16 | PreC | Do the characters before the current character contain the WM frequency description |
| F4-1 | HasNum9 | Do the current character and the surrounding characters include the figure “9”? |
| F4-2 | HasToken@ | Do the current character and the surrounding characters include the symbol “@”? |
| F4-3 | HasEnglishAlphabets | Do the current character and the surrounding characters include English letters? |
| F4-4 | HasTime | Do the current character and the surrounding characters contain time description such as hour, week, date, or year? |
| F5 | InListSectionName | Do the name of AN section involving the current character and the surrounding characters appear in the predefined section list? |
| F6 | Class | These three types of features indicate the type labels of the 3 characters before the current character |
Rules used in the postprocessing module.
| Number | Description of postprocessing rules |
|---|---|
| 1 | If the label “O” is followed by the label “I,” then, “I” is forcefully resolved to the same-type label “B” |
| 2 | If “B” is followed by a different-type label “I,” then, “I” is forcefully resolved to “B,” such as B-WM I-TCM ➔ B-WM B-WM |
| 3 | In Chinese ANs, the end of a drug name is rarely followed by another completely different therapeutic drug. In this case, we established the following rules, such as B-WM I-WM B-WM I-WM ➔ B-WM I-WM I-WM I-WM |
| 4 | If a drug name entity only contains “),” but not “(,” the starting position of the current drug name entity is moved ahead, while the label “B” is repositioned at the position of “(” |
| 5 | If “)” is annotated as label “O” and it immediately follows the end of the Chinese characters of the recognized drug name, then, this field end is expanded by one character to involve “)”; otherwise, the starting position of the field of the drug name is adjusted to be discarded “(” |
Rules used in evaluation.
| Score | Rule description |
|---|---|
| 1 | Medication entity is accurately detected, and divisions of class and boundary are both correct |
| 0.8 | Only one error is detected at the start position of the ME boundary |
| 0.6 | Only one error is detected at the end position of the ME boundary |
| 0.4 | Two errors are detected at the start and end positions of ME boundaries, respectively |
| 0 | ME is not detected, or the detected phrase is not a drug name entity annotated in the gold standard |
Performance of the baseline system 1 based on professional drug dictionaries and the maximum matching algorithm between drug name characters and pinyin.
| Precision | Recall | F-measure | |
|---|---|---|---|
| TCM drug name | 49.2% | 45.5% | 47.3% |
| WM drug name | 54.9% | 49.1% | 51.8% |
| All drug names | 53.2% | 48.0% | 50.5% |
Note: to the nearest 0.1%.
Confusion matrix of outputs from the filtration module of potential hot-sentence classification.
| Classification | Medication | No medication | Total |
|---|---|---|---|
| Medication |
| 21 | 716 |
| No medication | 57 |
| 19,681 |
| Total | 752 | 19,645 | 20,397 |
Figure 2Precision, recall, and F-measure obtained by CRF with different features under different CWS settings.
Figure 3Precision, recall, and F-measure obtained by CRF with different features under different CWS = 3 setting.
Figure 4Precision, recall, and F-measure obtained by the baseline systems versus the CCMER system.
Performance of the baseline system 3 based on CCMER (not on the use of hot-sentence detection) (feature sets: F1 + F3 + F4 + F5 + F6).
| Precision | Recall | F-measure | |
|---|---|---|---|
| TCM drug name | 73.4% | 71.3% | 72.3% |
| WM drug name | 70.2% | 72.3% | 71.4% |
| All drug names | 71.0% | 72.3% | 71.6% |
Note: to the nearest 0.1%.