| Literature DB >> 26099994 |
Markus Kreuzthaler, Stefan Schulz.
Abstract
BACKGROUND: In Western languages the period character is highly ambiguous, due to its double role as sentence delimiter and abbreviation marker. This is particularly relevant in clinical free-texts characterized by numerous anomalies in spelling, punctuation, vocabulary and with a high frequency of short forms.Entities:
Mesh:
Year: 2015 PMID: 26099994 PMCID: PMC4474545 DOI: 10.1186/1472-6947-15-S2-S4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Corpus based frequency counts (C) required for logλ calculation.
|
|
| |
|---|---|---|
Abbreviation detection.
| Method | BL | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|---|
| micro-avg. F1 | 0.62 | 0.60*' | 0.73*' | 0.83*' | 0.83* | 0.83* | 0.93*' |
| micro-avg. F1 | 0.60 | 0.60 | 0.70*' | 0.81*' | 0.83* | 0.84*' | 0.92*' |
Evaluation performance per feature set (1 Rule-based features; 2 Statistical features; 3 Scaling features; 4 Language-dependent features; 5 Length features; 6 Word type features). * significant difference to base line (BL) (p <0.05), ' significant difference to predecessor (p <0.05)
Abbreviation detection.
| Top 10 | 1 |
| 2 |
| 3 |
|
|---|---|---|---|---|---|---|
| 1 | Contains period | 0.30 | 1.34 | 3897.48 | ||
| 2 | All upper case | 0.02 | 0.80 | 3222.35 | ||
| 3 | Contains digit | 0.01 | 0.43 | 2592.76 | ||
| 4 | - | - | 0.31 | 2329.77 | ||
| 5 | - | - | 0.19 | 847.88 | ||
| 6 | - | - | - | - | 706.98 | |
| 7 | - | - | - | - | 511.38 | |
| 8 | - | - | - | - | 412.86 | |
| 9 | - | - | - | - | 204.80 | |
| 10 | - | - | - | - | 139.36 | |
| 1 | ∈ MDDict | 0.34 | LT border | 16.15 | St.p. | 409.58 |
| 2 | - | - | LT border | 16.15 | Amb. | 409.51 |
| 3 | - | - | LT border | 16.15 | o.B. | 409.09 |
| 4 | - | - | LT | 8.74 | re. | 407.87 |
| 5 | - | - | Mean-LT | 8.74 | Z.n. | 407.35 |
| 6 | - | - | 0.54 | li. | 407.28 | |
| 7 | - | - | 0.16 | ca. | 407.00 | |
| 8 | - | - | 0.10 | unauff. | 406.94 | |
| 9 | - | - | - | - | bds. | 406.19 |
| 10 | - | - | - | - | Pat. | 405.75 |
Top 10 feature rankings per feature set (1 Rule-based features; 2 Statistical features; 3 Scaling features; 4 Language-dependent features; 5 Length features; 6 Word type features). Length (LT); w2: Weight based feature relevance criterion.
Abbreviation detection.
| Method | BL | [1] | [1-2] | [1-3] | [1-4] | [1-5] | [1-6] |
|---|---|---|---|---|---|---|---|
| micro-avg. F1 | 0.62 | 0.60*' | 0.71*' | 0.86*' | 0.88*' | 0.95*' | 0.97*' |
| micro-avg. F1 | 0.60 | 0.60 | 0.71*' | 0.83*' | 0.86* | 0.93* | 0.95*' |
Evaluation performance combining feature sets stepwise according to their stand alone performance (1 Rule-based features; 2 Statistical features; 3 Scaling features; 4 Language-dependent features; 5 Length features; 6 Word type features). * significant difference to base line (BL) (p <0.05), ' significant difference to predecessor (p <0.05)
Abbreviation detection
| Top 10 | [1] |
| [1-2] |
| [1-3] |
|
|---|---|---|---|---|---|---|
| 1 | Contains period | 0.30 | Contains period | 0.35 | 5885.83 | |
| 2 | All upper case | 0.02 | 0.18 | 4855.66 | ||
| 3 | Contains digit | 0.01 | 0.13 | 1999.51 | ||
| 4 | - | - | 0.12 | 1798.60 | ||
| 5 | - | - | 0.09 | 1180.39 | ||
| 6 | - | - | 0.09 | 894.98 | ||
| 7 | - | - | All upper case | 0.02 | 715.70 | |
| 8 | - | - | Contains digit | 8.16E-5 | 617.98 | |
| 9 | - | - | - | - | 474.86 | |
| 10 | - | - | - | - | 256.81 | |
| 1 | 1063.78 | 1027.15 | LT | 952.62 | ||
| 2 | 962.33 | 914.02 | Mean-LT | 952.62 | ||
| 3 | 507.82 | 610.69 | All upper case | 549.64 | ||
| 4 | 391.68 | 527.28 | 529.85 | |||
| 5 | 379.70 | 463.94 | 521.60 | |||
| 6 | 325.68 | 274.81 | erforderl. | 403.54 | ||
| 7 | 265.62 | 253.30 | pathol. | 392.23 | ||
| 8 | 222.55 | Mean-LT | 145.91 | verschiebl. | 375.40 | |
| 9 | 143.67 | LT | 145.91 | d-lat. | 358.11 | |
| 10 | 129.90 | 90.13 | entzündl. | 345.21 | ||
Top 10 feature rankings per feature set (1 Rule-based features; 2 Statistical features; 3 Scaling features; 4 Language-dependent features; 5 Length features; 6 Word type features). Length (LT); w2: Weight based feature relevance criterion.
Sentence detection.
| Method | BL | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| micro-avg. F1 | 0.78 | 0.58*' | 0.76*' | 0.79*' | 0.79* | 0.82*' | 0.90*' | 0.92*' |
| micro-avg. F1 | 0.75 | 0.60*' | 0.74*' | 0.78*' | 0.81*' | 0.77*' | 0.87*' | 0.92*' |
Evaluation performance per feature set (1 Language features; 2 Rule-based features; 3 Text format features; 4 Word length features; 5 Right context word type features; 6 Word type features; 7 Abbreviation feature). * significant difference to base line (BL) (p <0.05), ' significant difference to predecessor (p <0.05)
Sentence detection
| Top 10 | 1 |
| 2 |
| 3 |
|
|---|---|---|---|---|---|---|
| 1 | ∈ CCDict | 0.07 | Capitalization | 1.84 | No " | 0.32 |
| 2 | ∈ MDDict | 2.15E-3 | All upper case | 0.54 | Double " | 0.06 |
| 3 | - | - | Contains digit | 0.27 | Single " | 0.03 |
| 4 | - | - | Contains period | 1.59E-5 | - | - |
| 5-10 | - | - | - | - | - | - |
Top 10 feature rankings per feature set (1 Language features; 2 Rule-based features; 3 Text format features). w2: Weight based feature relevance criterion.
Sentence detection.
| Top 10 | 4 |
| 5 |
| 6 |
| 7 |
|
|---|---|---|---|---|---|---|---|---|
| 1 | LT border | 60.82 | Die | 121.98 | St.p. | 415.20 | Abbr | 0.54 |
| 2 | LT border | 60.82 | für | 121.03 | Amb. | 410.82 | - | - |
| 3 | LT border | 60.82 | TE | 94.84 | ca. | 402.62 | - | - |
| 4 | LT | 1.98 | Keine | 94.09 | Pat. | 401.16 | - | - |
| 5 | Mean-LT | 1.98 | Sono | 83.67 | max. | 397.93 | - | - |
| 6 | 0.13 | Der | 80.47 | Z.n. | 392.47 | - | - | |
| 7 | 0.04 | CT | 77.13 | st.p. | 390.62 | - | - | |
| 8 | 2.27E-4 | E-Nr | 75.40 | n. | 378.70 | - | - | |
| 9 | - | - | Im | 71.92 | St. | 377.24 | - | - |
| 10 | - | - | Am | 66.45 | bzw. | 368.27 | - | - |
Top 10 feature rankings per feature set (4 Word length features; 5 Right context word type features; 6 Word type features; 7 Abbreviation feature). Length (LT); w2: Weight based feature relevance criterion.
Sentence detection.
| Method | BL | [1] | [1-2] | [1-3] | [1-4] | [1-5] | [1-6] | [1-7] |
|---|---|---|---|---|---|---|---|---|
| micro-avg. F1 | 0.78 | 0.58*' | 0.76*' | 0.88*' | 0.92*' | 0.95*' | 0.96*' | 0.97*' |
| micro-avg. F1 | 0.75 | 0.60*' | 0.75' | 0.86*' | 0.91*' | 0.93*' | 0.94*' | 0.94* |
Evaluation performance combining feature sets stepwise according to their stand alone performance (1 Language features; 2 Rule-based features; 3 Text format features; 4 Word length features; 5 Right context word type features; 6 Word type features; 7 Abbreviation feature). * significant difference to base line (p <0.05), ' significant difference to predecessor (p <0.05)
Sentence detection
| Top 10 | [1] |
| [1-2] |
| [1-3] |
|
|---|---|---|---|---|---|---|
| 1 | ∈ CCDict | 0.07 | Capitalization | 2.67 | Capitalization | 1.54 |
| 2 | ∈ MDDict | 2.15E-3 | All upper case | 0.47 | No " | 1.09 |
| 3 | - | - | ∈ CCDict | 0.43 | ∈ CCDict | 0.58 |
| 4 | - | - | Contains digit | 0.21 | Double " | 0.48 |
| 5 | - | - | Contains period | 0.02 | All upper case | 0.17 |
| 6 | - | - | ∈ MDDict | 8.32E-4 | Single " | 0.11 |
| 7 | - | - | - | - | Contains digit | 0.07 |
| 8 | - | - | - | - | ∈ MDDict | 0.03 |
| 9 | - | - | - | - | Contains period | 0.01 |
| 10 | - | - | - | - | - | - |
Top 10 feature rankings per feature set (1 Language features; 2 Rule-based features; 3 Text format features). w2: Weight based feature relevance criterion.
Sentence detection.
| Top 10 | [1-4] |
| [1-5] |
|
|---|---|---|---|---|
| 1 | Capitalization | 11.34 | LT | 674.21 |
| 2 | LT | 10.52 | Mean-LT | 674.21 |
| 3 | Mean-LT | 10.52 | Capitalization | 637.85 |
| 4 | No " | 4.82 | Rippenanteile | 627.54 |
| 5 | ∈ CCDict | 4.08 | Lymphknoten | 356.25 |
| 6 | Double " | 3.77 | Double " | 336.64 |
| 7 | All upper case | 0.97 | Lungengerüstzeichnung | 332.50 |
| 8 | Contains digit | 0.71 | Integument | 321.86 |
| 9 | 0.31 | No " | 300.18 | |
| 10 | Contains period | 0.19 | Normale | 277.68 |
| 1 | Capitalization | 971.25 | Abbreviation | 1326.41 |
| 2 | Mean-LT | 840.45 | Capitalization | 867.06 |
| 3 | LT | 840.45 | o.B. | 382.83 |
| 4 | Double " | 341.46 | Double " | 374.57 |
| 5 | No " | 324.25 | No " | 364.32 |
| 6 | o.B. | 259.13 | bds. | 282.13 |
| 7 | Rippenanteile | 254.91 | CT | 266.54 |
| 8 | mitresez. | 254.91 | Leberlappen | 225.08 |
| 9 | CT | 251.41 | A. | 206.77 |
| 10 | Leberlappen | 236.26 | Narbige | 191.01 |
Top 10 feature rankings per feature set (1 Language features; 2 Rule-based features; 3 Text format features; 4 Word length features; 5 Right context word type features (RC); 6 Word type features; 7 Abbreviation feature). Length (LT); w2: Weight based feature relevance criterion.