| Literature DB >> 24883358 |
Derek F Wong1, Lidia S Chao1, Xiaodong Zeng1.
Abstract
Sentence boundary detection (SBD) system is normally quite sensitive to genres of data that the system is trained on. The genres of data are often referred to the shifts of text topics and new languages domains. Although new detection models can be retrained for different languages or new text genres, previous model has to be thrown away and the creation process has to be restarted from scratch. In this paper, we present a multilingual sentence boundary detection system (iSentenizer-μ) for Danish, German, English, Spanish, Dutch, French, Italian, Portuguese, Greek, Finnish, and Swedish languages. The proposed system is able to detect the sentence boundaries of a mixture of different text genres and languages with high accuracy. We employ i (+)Learning algorithm, an incremental tree learning architecture, for constructing the system. iSentenizer-μ, under the incremental learning framework, is adaptable to text of different topics and Roman-alphabet languages, by merging new data into existing model to learn the new knowledge incrementally by revision instead of retraining. The system has been extensively evaluated on different languages and text genres and has been compared against two state-of-the-art SBD systems, Punkt and MaxEnt. The experimental results show that the proposed system outperforms the other systems on all datasets.Entities:
Mesh:
Year: 2014 PMID: 24883358 PMCID: PMC4030568 DOI: 10.1155/2014/196574
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Four classes of the importance grades.
| Grade | Action in tree revision |
|---|---|
| High | Attribute |
|
| |
| Medium | Attribute |
|
| |
| Low | Attribute |
|
| |
| None | Attribute |
Box 1Sentence labeled with boundary tags.
Box 2Examples of instances represented as feature vectors.
A confusion matrix for BSD output. “True” denotes positive cases, that is, the sentence boundaries.
| System true | System false | |
|---|---|---|
| Test true |
|
|
| Test false |
|
|
Size of the Brown, WSJ, and Tycho Brahe corpora.
| Corpus | Sentences | Tokens | |
|---|---|---|---|
| Training data | Test data | ||
| WSJ corpus | 41,977 | 4,671 | 1,153,993 |
| Brown corpus | 51,599 | 5,801 | 1,155,242 |
| Tycho Brahe corpus | 38,000 | 5,102 | 953,080 |
Information of Europarl corpus.
| Language | Sentences | Tokens | |
|---|---|---|---|
| Training data | Test Data | ||
| Danish | 30,343 | 3,375 | 917,231 |
| German | 29,854 | 3,319 | 890,176 |
| English | 29,774 | 3,309 | 949,716 |
| Spanish | 33,869 | 3,765 | 1,082,826 |
| Dutch | 29,604 | 3,389 | 688,018 |
| French | 29,887 | 3,321 | 1,098,724 |
| Italian | 27,589 | 5,067 | 929,042 |
| Portuguese | 28,967 | 2,777 | 947,086 |
| Greek | 27,687 | 3,077 | 888,321 |
| Finnish | 29,504 | 3,309 | 687,804 |
| Swedish | 26,649 | 2,962 | 765,795 |
Number of abbreviations in corpora.
| Corpus | Number of abbreviations (train) | Number of abbreviations (test) |
|---|---|---|
| WSJ corpus | 27,960 | 3,110 |
| Brown corpus | 644 | 158 |
| Tycho Brahe corpus | 382 | 8 |
Classification results of baseline model.
| Corpus | Recall | Precision |
|
|---|---|---|---|
| WSJ corpus | 0.9757 | 0.9815 | 0.9786 |
| Brown corpus | 0.9955 | 0.9995 | 0.9975 |
| Tycho Brahe corpus | 0.9973 | 0.9983 | 0.9978 |
Classification results of iSentenizer-μ.
| Corpus | Recall | Precision |
|
|---|---|---|---|
| WSJ corpus | 0.9843 | 0.9918 |
|
| Brown corpus | 0.9967 | 0.9995 |
|
| Tycho Brahe corpus | 0.9973 | 0.9983 | 0.9978 |
Abbreviations classification error rate.
| Corpus | Baseline |
| Reduced error rate |
|---|---|---|---|
| Errors | Errors | ||
| WSJ corpus | 548 | 34 | 94% (↓) |
| Brown corpus | 0 | 0 | 0% |
| Tycho Brahe corpus | 6 | 2 | 67% (↓) |
Performance of systems on different languages of Europarl corpus.
| Corpus | Candidates | Recall | Precision |
|
|---|---|---|---|---|
| Danish |
|
| 92.88% |
|
| Punkt | 97.69% | 79.37% | 87.59% | |
| MxTerminator | 35.48% |
| 51.54% | |
|
| ||||
| German |
| 97.61% |
|
|
| Punkt |
| 87.53% | 92.41% | |
| MxTerminator | 81.00% | 93.69% | 86.89% | |
|
| ||||
| English |
|
|
|
|
| Punkt | 97.95% | 93.34% | 95.59% | |
| MxTerminator | 96.09% | 93.97% | 95.02% | |
|
| ||||
| Spanish |
|
|
|
|
| Punkt | 98.11% | 89.80% | 93.77% | |
| MxTerminator | 96.67% | 90.09% | 93.26% | |
|
| ||||
| Dutch |
|
|
|
|
| Punkt | 97.79% | 92.34% | 94.99% | |
| MxTerminator | 91.95% | 95.32% | 93.61% | |
|
| ||||
| French |
|
|
|
|
| Punkt | 97.84% | 91.37% | 94.49% | |
| MxTerminator | 95.04% | 91.88% | 93.44% | |
|
| ||||
| Italian |
|
|
|
|
| Punkt | 98.25% | 93.69% | 95.92% | |
| MxTerminator | 94.96% | 94.43% | 94.70% | |
|
| ||||
| Portuguese |
|
|
|
|
| Punkt | 98.50% | 95.76% | 97.11% | |
| MxTerminator | 94.88% | 96.50% | 95.68% | |
|
| ||||
| Greek |
|
|
|
|
| Punkt | 96.98% | 95.36% | 96.16% | |
| MxTerminator | 97.24% | 93.97% | 95.58% | |
|
| ||||
| Finnish |
|
|
|
|
| Punkt | 98.33% | 92.34% | 95.24% | |
| MxTerminator | 92.46% | 95.32% | 93.87% | |
|
| ||||
| Swedish |
| 95.91% |
|
|
| Punkt | 98.94% | 89.45% | 93.95% | |
| MxTerminator |
| 88.33% | 93.57% | |
Results on the Brown, WSJ, and Tycho Brahe corpus.
| Corpus | Candidates | Recall | Precision |
|
|---|---|---|---|---|
| WSJ corpus |
|
|
|
|
| Punkt | 93.08% | 57.84% | 71.34% | |
| MxTerminator | 93.08% | 60.24% | 73.14% | |
|
| ||||
| Brown corpus |
|
| 99.98% |
|
| Punkt | 96.30% | 99.95% | 98.09% | |
| MxTerminator | 96.30% | 99.98% | 98.11% | |
|
| ||||
| Tycho Brahe corpus |
|
| 99.86% |
|
| Punkt | 79.83% | 99.90% | 88.74% | |
| MxTerminator | 79.83% |
| 88.77% | |
Trained on the Brown corpus and tested on the Tycho Brahe corpus.
| Candidates | Recall | Precision |
|
|---|---|---|---|
|
|
| 99.02% |
|
| Punkt | 71.80% | 99.90% | 83.55% |
| MxTerminator | 71.80% |
| 83.59% |
Trained on the Tycho Brahe corpus and tested on the Brown corpus.
| Candidates | Recall | Precision |
|
|---|---|---|---|
|
|
|
|
|
| Punkt | 95.72% | 80.07% | 87.20% |
| MxTerminator | 93.54% | 93.94% | 93.74% |
Results of cross Europarl corpora evaluation.
| Candidates | Recall | Precision |
|
|---|---|---|---|
|
|
| 94.91% |
|
| Punkt | 98.01% | 90.11% | 93.89% |
| MxTerminator | 92.25% |
| 93.92% |
Classification results on mixture of Europarl corpora.
| Corpus | Recall | Precision |
|
|---|---|---|---|
|
| 98.15% | 86.28% | 91.83% |
|
| 94.12% | 94.59% | 94.35% |
|
| 96.72% | 89.60% | 93.02% |
|
| 96.29% | 94.37% | 95.32% |