| Literature DB >> 24772017 |
Ling Zhu1, Derek F Wong1, Lidia S Chao1.
Abstract
This paper presents a novel approach for unsupervised shallow parsing model trained on the unannotated Chinese text of parallel Chinese-English corpus. In this approach, no information of the Chinese side is applied. The exploitation of graph-based label propagation for bilingual knowledge transfer, along with an application of using the projected labels as features in unsupervised model, contributes to a better performance. The experimental comparisons with the state-of-the-art algorithms show that the proposed approach is able to achieve impressive higher accuracy in terms of F-score.Entities:
Mesh:
Year: 2014 PMID: 24772017 PMCID: PMC3977424 DOI: 10.1155/2014/401943
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Direct label projection from English to Chinese with position information.
Figure 2Adjust the Cchunk tag based on position information.
Algorithm 1Graph-based unsupervised chunking approach.
Various features for computing edge weights between Chinese trigram types.
| Description | Feature |
|---|---|
| Trigram + context |
|
| Trigram |
|
| Left context |
|
| Right context |
|
| Center word |
|
| Trigram − center word |
|
| Left word + right context |
|
| Right word + left context |
|
| Suffix | Has suffix ( |
| Prefix | Has prefix ( |
Figure 3An example of similarity graph over trigram on labeled and unlabeled data.
The description of universal chunk tags.
| Tag | Description | Words | Example |
|---|---|---|---|
| NP | Noun phrase | DET + ADV + ADJ + NOUN | The strange birds |
| PP | Preposition phrase | TO + IN | In between |
| VP | Verb phrase | ADV + VB | Was looking |
| ADVP | Adverb phrase | ADV | Also |
| ADJP | Adjective phrase | CONJ + ADV + ADJ | Warm and cozy |
| SBAR | Subordinating conjunction | IN | Whether or not |
Feature template used in unsupervised chunking.
| Basic: | ( |
|
| |
| Contains digit: | check if |
|
| |
| Contains hypen: |
contains hypen( |
|
| |
| Suffix: | indicator features for character suffixes of up to length 1 present in |
|
| |
| Prefix: | indicator features for character prefixes of up to length 1 present in |
|
| |
| Pos tag: | indicator feature for word POS assigned to |
Box 1Example of feature template.
Chunk tagging evaluation results for various baselines and proposed graph-based model.
| Tag | Feature-HMM | Projection | Graph-based | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Pre | Rec | F1 | Pre | Rec | F1 | Pre | Rec | F1 | |
| NP | 0.60 | 0.67 | 0.63 | 0.67 | 0.72 | 0.68 | 0.81 | 0.82 | 0.81 |
| VP | 0.56 | 0.51 | 0.53 | 0.61 | 0.57 | 0.59 | 0.78 | 0.74 | 0.76 |
| PP | 0.36 | 0.28 | 0.32 | 0.44 | 0.31 | 0.36 | 0.60 | 0.51 | 0.55 |
| ADVP | 0.40 | 0.46 | 0.43 | 0.45 | 0.52 | 0.48 | 0.64 | 0.68 | 0.66 |
| ADJP | 0.47 | 0.53 | 0.50 | 0.48 | 0.58 | 0.51 | 0.67 | 0.71 | 0.69 |
| SBAR | 0.00 | 0.00 | 0.00 | 0.50 | 1.0 | 0.66 | 0.50 | 1.0 | 0.66 |
| All | 0.49 | 0.51 | 0.50 | 0.57 | 0.62 | 0.59 |
|
|
|
(a)
| Number of sentence pairs | Number of seeds | Number of words |
|---|---|---|
| 10,000 | 27,940 | 31,678 |
(b)
| Number of sentences | Number of vertices |
|---|---|
| 17,617 | 185,441 |
(c)
| Dataset | Source | Number of sentences |
|---|---|---|
| Training dataset | Xinhua 1–321 | 7,617 |
| Testing dataset | Xinhua 363–403 | 912 |