| Literature DB >> 28143488 |
Wutao Lin1, Donghong Ji2, Yanan Lu3.
Abstract
BACKGROUND: Information extraction in clinical texts enables medical workers to find out problems of patients faster as well as makes intelligent diagnosis possible in the future. There has been a lot of work about disorder mention recognition in clinical narratives. But recognition of some more complicated disorder mentions like overlapping ones is still an open issue. This paper proposes a multi-label structured Support Vector Machine (SVM) based method for disorder mention recognition. We present a multi-label scheme which could be used in complicated entity recognition tasks.Entities:
Keywords: Clinical text; Information extraction; Multi-label; Structured support vector machine
Mesh:
Year: 2017 PMID: 28143488 PMCID: PMC5282630 DOI: 10.1186/s12859-017-1476-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Examples of disorder mentions
Fig. 2The flow of our model
Design of the multi-labels
| Type | Forms of the multi-labels |
|---|---|
| Class | 000000,000000,000000, |
| Class | 000000,000000, |
| Class | 000000, |
| Class |
|
| Class |
|
| Class | 000000,000000,000000,000000 |
to make the 24-bit label easier to understand, extra commas are used to split the label
Fig. 3Examples of the sub-labels
Fig. 4Example of Algorithm 1
Examples of the final labels
| Token | Final label |
|---|---|
| Abdomen | 000000,000000,000011,000000 |
| is | 000000,000000,000000,000000 |
| soft | 000000,000000,000000,000000 |
| , | 000000,000000,000000,000000 |
| nontender | 000000,000010,000000,000000 |
| , | 000000,000000,000000,000000 |
| nondistended | 000000,000000,000000,000100 |
| , | 000000,000000,000000,000000 |
| negative | 000000,000000,000000,000000 |
| bruits | 000000,000001,000000,000000 |
Fig. 5Example of Algorithm 2
Feature set description
| Feature | Description |
|---|---|
| Bag of Words | Bag of Words in a 5-word window. |
| Part of Speeches | Part of Speeches in a 7-word window. |
| Capitalization | Convert all alphabetic characters of the words to uppercase [ |
| Case pattern | The patterns are generated by the following steps. Similar to [ |
| Word representation | We use word2vec to acquire 700 clusters from the unlabeled clinical narratives and give each cluster a different serial number. Then we take the serial number of the clusters as a feature. The window size is 3. |
Statistics for three types of disorder mentions
| Disorder type | Amount | Percentage |
|---|---|---|
| Contiguous | 9867 | 88.45% |
| Discontiguous | 565 | 5.06% |
| Overlapping | 724 | 6.49% |
| Total | 11156 | 100.00% |
Statistics for discontiguous disorder mentions
| Disorder type | Amount | Percentage |
|---|---|---|
| 1 breakpoint | 1027 | 94.31% |
| 2 breakpoints | 62 | 5.69% |
| 3 or more breakpoints | 0 | 0.00% |
| Total | 1089 | 100.00% |
Statistics for overlapping disorder mentions
| Disorder type | Amount | Percentage |
|---|---|---|
| 2 disorder mentions overlap with each other | 482 | 66.57% |
| 3 disorder mentions overlap with each other | 198 | 27.35% |
| 4 disorder mentions overlap with each other | 28 | 3.87% |
| 5 disorder mentions overlap with each other | 10 | 1.38% |
| 6 disorder mentions overlap with each other | 6 | 0.83% |
| 7 or more disorder mentions overlap with each other | 0 | 0.00% |
| Total | 724 | 100.00% |
Disorder mentions with different span lengths
| Span length | Disorder amount | Percentage |
|---|---|---|
| 1 | 5172 | 46.36% |
| 2 | 3158 | 28.31% |
| 3 | 1580 | 14.16% |
| 4 | 474 | 4.25% |
| 5 | 340 | 3.05% |
| 6 or more | 432 | 3.87% |
| Total | 11156 | 100.00% |
Results for multi-label SSVM model with different feature sets
| Feature set | Precision | Recall | F 1-Score |
|---|---|---|---|
| SSVM + BOW | 0.7626 | 0.3329 | 0.4635 |
| SSVM + BOW + POS | 0.7953 | 0.3857 | 0.5195 |
| SSVM + BOW + POS + capitalization | 0.8417 | 0.5702 | 0.6799 |
| SSVM + BOW + POS + capitalization + case pattern | 0.8398 | 0.5839 | 0.6889 |
| SSVM + BOW + POS + capitalization + case pattern + word representation | 0.8244 | 0.6620 | 0.7343 |
Results for different evaluation modes
| Mode | Precision | Recall | F 1-Score |
|---|---|---|---|
| Strict | 0.8244 | 0.6620 | 0.7343 |
| Relaxed (left match) | 0.8229 | 0.6826 | 0.7462 |
| Relaxed (right match) | 0.8441 | 0.6995 | 0.7650 |
Results for different types of disorder mentions
| Type | Item | Value |
|---|---|---|
| Contiguous | Precision | 0.8262 |
| Recall | 0.7036 | |
| F 1-Score | 0.7600 | |
| Discontiguous | Precision | 0.6914 |
| Recall | 0.3060 | |
| F 1-Score | 0.4242 | |
| Overlapping | Precision | 0.8632 |
| Recall | 0.2832 | |
| F 1-Score | 0.4265 |
Fig. 6Comparison between SSVM and CRF model
Fig. 7Comparison among BIOHD, BIOHD1234 and our multi-label scheme
Comparison between SSVM model with different feature sets
| Features | Precision | Recall | F 1-Score |
|---|---|---|---|
| Our features | 0.6560 | 0.5875 | 0.6199 |
| Tang’s features | 0.842 | 0.722 | 0.777 |
Fig. 8Results for disorder mentions with different span lengths