| Literature DB >> 24895681 |
Qiuping Huang1, Liangye He1, Derek F Wong1, Lidia S Chao1.
Abstract
This paper investigates the recognition of unknown words in Chinese parsing. Two methods are proposed to handle this problem. One is the modification of a character-based model. We model the emission probability of an unknown word using the first and last characters in the word. It aims to reduce the POS tag ambiguities of unknown words to improve the parsing performance. In addition, a novel method, using graph-based semisupervised learning (SSL), is proposed to improve the syntax parsing of unknown words. Its goal is to discover additional lexical knowledge from a large amount of unlabeled data to help the syntax parsing. The method is mainly to propagate lexical emission probabilities to unknown words by building the similarity graphs over the words of labeled and unlabeled data. The derived distributions are incorporated into the parsing process. The proposed methods are effective in dealing with the unknown words to improve the parsing. Empirical results for Penn Chinese Treebank and TCT Treebank revealed its effectiveness.Entities:
Mesh:
Year: 2014 PMID: 24895681 PMCID: PMC4032743 DOI: 10.1155/2014/959328
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
The effect of the character-based model on TCT.
| Length |
|
|
| |
|---|---|---|---|---|
| Baseline | All | 80.97 | 80.99 | 80.98 |
| ≤40 | 83.56 | 83.55 | 83.55 | |
|
| ||||
| Character-based | All | 82.76 | 82.47 | 82.83 |
| ≤40 | 84.96 | 85.08 | 85.02 | |
Algorithm 1Words label propagation algorithm.
Features employed to measure the similarity between two vertices, in a given text example “他非常专业” (I am very happy), where the trigram is “非常专”.
| Feature | Example |
|---|---|
| Trigram + Context |
|
| Trigram |
|
| Left Context |
|
| Right Context |
|
| Center Word |
|
| Left Word + Right Word |
|
| Left Word + Right Context |
|
| Left Context + Right Word |
|
The statistics summary of data in CTB-5.0.
| Train | Unlabeled | Dev | Test | |
|---|---|---|---|---|
| #Sentence | 17,785 | 19,075 | 352 | 348 |
| #Word | 485,230 | 1,110,947 | 6,821 | 8,008 |
| #OOV | — | — | 382 | 263 |
The statistics summary of data in TCT.
| Train | Unlabeled | Dev | Test | |
|---|---|---|---|---|
| #Sentence | 14,045 | 19,075 | 1,755 | 1,758 |
| #Word | 377,303 | 1,110,947 | 47,836 | 48,449 |
| #OOV | — | — | 1,928 | 1,916 |
POS and parsing accuracy on TCT in character-based model.
| Length |
|
|
| POS | |
|---|---|---|---|---|---|
| Baseline | All | 80.97 | 80.99 | 80.98 | 94.51 |
| ≤40 | 83.56 | 83.55 | 83.55 | 94.56 | |
|
| |||||
| TCT | All | 82.76 | 82.47 | 82.83 | 94.80 |
| ≤40 | 84.96 | 85.08 | 85.02 | 94.76 | |
POS and parsing accuracy on CTB in graph-based OOV model.
| Length |
|
|
| POS | |
|---|---|---|---|---|---|
| Baseline | All | 78.34 | 82.68 | 80.45 | 94.88 |
| ≤40 | 81.78 | 85.63 | 83.66 | 95.58 | |
|
| |||||
| CTB | All | 78.90 | 83.20 | 80.99 | 95.77 |
| ≤40 | 82.38 | 86.34 | 84.31 | 96.31 | |
POS and parsing accuracy on TCT in graph-based OOV model.
| Length |
|
|
| POS | |
|---|---|---|---|---|---|
| Baseline | All | 80.97 | 80.99 | 80.98 | 94.51 |
| ≤40 | 83.56 | 83.55 | 83.55 | 94.56 | |
|
| |||||
| TCT | All | 81.30 | 81.32 | 81.31 | 95.51 |
| ≤40 | 83.92 | 83.91 | 83.92 | 95.60 | |