| Literature DB >> 24976943 |
Małgorzata Marciniak1, Agnieszka Mykowiecka1.
Abstract
BACKGROUND: Hospital documents contain free text describing the most important facts relating to patients and their illnesses. These documents are written in specific language containing medical terminology related to hospital treatment. Their automatic processing can help in verifying the consistency of hospital documentation and obtaining statistical data. To perform this task we need information on the phrases we are looking for. At the moment, clinical Polish resources are sparse. The existing terminologies, such as Polish Medical Subject Headings (MeSH), do not provide sufficient coverage for clinical tasks. It would be helpful therefore if it were possible to automatically prepare, on the basis of a data sample, an initial set of terms which, after manual verification, could be used for the purpose of information extraction.Entities:
Year: 2014 PMID: 24976943 PMCID: PMC4062289 DOI: 10.1186/2041-1480-5-24
Source DB: PubMed Journal: J Biomed Semantics
Distribution of phrase lengths
| ∑ | 4156 | 11354 | 14156 | 1354 | 32.58 |
| 1 | 1381 | 2219 | 2880 | 720 | 52.14 |
| 2 | 1644 | 4212 | 5403 | 453 | 27.55 |
| 3 | 801 | 2941 | 3605 | 137 | 17.10 |
| 4 | 242 | 1301 | 1511 | 32 | 13.22 |
| 5 | 68 | 476 | 534 | 10 | 14.71 |
| > 5 | 20 | 205 | 223 | 2 | 10.00 |
| Max | 12(8) | 5(7) | 12(8) | 0 | - |
Distribution of phrase frequencies
| ∑ | 4156 | 11354 | 14156 |
| =1 | 2272 | 7120 | 8211 |
| 2–10 | 1417 | 4076 | 4572 |
| 11–50 | 325 | 922 | 969 |
| 51–100 | 71 | 115 | 157 |
| 101–1000 | 71 | 168 | 217 |
| 1000- | 0 | 28 | 30 |
Standard C-value distribution
| ∑ | 4156 | 11354 | 14156 |
| C = 0 | 1110 | 3458 | 4163 |
| C > 0 | 3046 | 7896 | 9993 |
| 0 < | 893 | 1509 | 1936 |
| C = 1 | 565 | 1301 | 1708 |
| C > 1 | 1588 | 5086 | 6349 |
| 1 < | 898 | 2842 | 3531 |
| C > 2.5 | 690 | 2244 | 2818 |
C -value distribution
| ∑ | 4156 | 11354 | 14156 |
| C = 0 | 2843 | 4140 | 4933 |
| C > 0 | 2843 | 7214 | 9223 |
| 0 < | 775 | 1243 | 1625 |
| C = 1 | 581 | 1339 | 1757 |
| 1 < | 843 | 1487 | 3227 |
| C > 2.5 | 644 | 2068 | 2614 |
Comparison with general corpus
| Common with NKJP | 791 | 1155 |
| 1-word | 680 | 969 |
| Multi words | 111 | 186 |
| C 1-value greater in NKJP | 431 | 546 |
| 1-word | 374 | 477 |
| Multi words | 57 | 69 |
Top 20 phrases in data
| 185.60 | 116 | 0 | |
| information card’ | | | |
| 124.00 | 155 | 4 | |
| 114.04 | 118 | 27 | |
| 107.82 | 122 | 17 | |
| 102.66 | 75 | 62 | |
| 102.17 | 55 | 0 | |
| 93.60 | 117 | 0 | |
| 93.60 | 117 | 0 | |
| 92.80 | 116 | 0 | |
| 92.14 | 66 | 10 | |
| 91.28 | 114 | 0 | |
| 91.28 | 114 | 0 | |
| 79.51 | 93 | 9 | |
| 78.14 | 52 | 12 | |
| 74.81 | 59 | 0 | |
| 73.54 | 58 | 0 | |
| 69.35 | 4 | 59 | |
| 62.56 | 42 | 11 | |
| 58.80 | 1 | 87 | |
| 55.20 | 35 | 665 |
Top 20 phrases in surgical data
| 1862.40 | 1164 | 1 | |
| information card’ | | | |
| 1332.80 | 833 | 0 | |
| 1030.95 | 1170 | 112 | |
| 964.56 | 1167 | 43 | |
| 943.26 | 1179 | 3 | |
| 931.20 | 1164 | 0 | |
| 924.80 | 1156 | 0 | |
| 735.22 | 919 | 1 | |
| 678.09 | 124 | 317 | |
| 662.48 | 325 | 525 | |
| 609.60 | 762 | 1 | |
| 526.40 | 414 | 0 | |
| 520.80 | 649 | 4 | |
| 511.34 | 377 | 51 | |
| 508.30 | 67 | 267 | |
| 470.00 | 1 | 1173 | |
| 468.70 | 238 | 14 | |
| 466.40 | 0 | 1166 | |
| 430.81 | 222 | 422 | |
| 410.84 | 324 | 1 |
Phrases in texts
| nb of phrases | 241 | 235 | 208 |
| nb of extr. phr. | 199 | 190 | 175 |
| % of extr. phr. | 82.5 | 80.0 | 84.1 |
Phrases in surgery texts
| nb of phrases | 163 | 164 | 138 |
| nb of extr. phr. | 134 | 136 | 116 |
| % of extr. phr. | 82.2 | 82.9 | 84.0 |
Phrases considered as terms in documents
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||
| | ||||||||||||
| top200 | 176 | 88 | 19 | 9.5 | 5 | 2.5 | 178 | 89 | 14 | 7 | 8 | 4 |
| middle100 | 88 | 88 | 5 | 5.0 | 7 | 7.0 | 83 | 83 | 8 | 8 | 9 | 9 |
| end100 | 75 | 75 | 18 | 18.0 | 7 | 7.0 | 82 | 82 | 10 | 10 | 8 | 8 |
Phrases considered as terms in surgery documents
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||
| | ||||||||||||
| top400 | 353 | 88.3 | 28 | 7.0 | 19 | 4.7 | 348 | 87.0 | 27 | 6.7 | 25 | 6.3 |
| middle200 | 136 | 68.0 | 11 | 5.5 | 43 | 21.5 | 145 | 72.5 | 14 | 7.0 | 41 | 20.5 |
| end200 | 127 | 63.5 | 33 | 16.5 | 40 | 20.0 | 121 | 60.5 | 35 | 17.5 | 44 | 22.0 |
Comparison of the results for different grammars for surgery documents
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | ||||||||||||
| | ||||||||||||
| top400 | 353 | 88.3 | 28 | 7.0 | 19 | 4.7 | 350 | 87.5 | 19 | 4.75 | 31 | 7.75 |
| next400 | 331 | 82.8 | 19 | 12.5 | 50 | 12.5 | 310 | 77.5 | 15 | 3.75 | 75 | 18.75 |
The sets of rules for recognizing noun phrases
| I | N subst | ger |
| | NC (foreign_subst | foreign) +foreign?+foreign? |
| | NC brev
|
| | NC brev
|
| | NC brev
|
| | NC brev
|
| | AJ ^2 adv?+(adj
|
| | AC brev
|
| | CN “i” |
| II | A AJ+adv? |
| | A ^3 AC + “-” + AJ
|
| | A ^3 adja + “-” + AJ
|
| | AC AC + “-”+AC |
| | N N
|
| | NZ subst(lemma=to/co/obrąb/kierunek/cel/czas/ moŻliwość/podstawa/ciąg/cecha/...) |
| | AZ IR(lemma=aktualny/daleki/gdy/pewien/wzgląd/ ten/inny/sam/niektóry/wczesny/...) |
| III | ADJP A |
| | ADJP A
|
| | ADJP A
|
| IV | NB ^2 NC+ADJP |
| | NB ^2 AC+N |
| | NB N+AC |
| | NB ADJP
|
| | NB ADJP
|
| V | NG NB
|
| | NG NB |
| VI | X NG+NG
|
| X NG+NC |