| Literature DB >> 27446207 |
Phuoc Tran1, Dien Dinh2, Hien T Nguyen1.
Abstract
Chinese and Vietnamese have the same isolated language; that is, the words are not delimited by spaces. In machine translation, word segmentation is often done first when translating from Chinese or Vietnamese into different languages (typically English) and vice versa. However, it is a matter for consideration that words may or may not be segmented when translating between two languages in which spaces are not used between words, such as Chinese and Vietnamese. Since Chinese-Vietnamese is a low-resource language pair, the sparse data problem is evident in the translation system of this language pair. Therefore, while translating, whether it should be segmented or not becomes more important. In this paper, we propose a new method for translating Chinese to Vietnamese based on a combination of the advantages of character level and word level translation. In addition, a hybrid approach that combines statistics and rules is used to translate on the word level. And at the character level, a statistical translation is used. The experimental results showed that our method improved the performance of machine translation over that of character or word level translation.Entities:
Mesh:
Year: 2016 PMID: 27446207 PMCID: PMC4942671 DOI: 10.1155/2016/9821608
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Examples for incorrect translations of CL and WL translation systems.
Chinese numeric characters (from 0 to 9).
| Chinese numbers | 零 | 一 | 二 | 三 | 四 | 五 | 六 | 七 | 八 | 九 |
|
| ||||||||||
| Vietnamese numbers | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Chinese unit characters.
| Chinese unit characters | 十 | 百 | 千 | 万 | 亿 |
|
| |||||
| Vietnamese numbers | 10 | 100 | 1,000 | 10,000 | 100,000,000 |
Figure 2Our translation model.
Figure 3Samples of dictionaries in rule-based translation.
Figure 4Illustration about a translation of a Chinese sentence by our system.
Distribution of number of words and number of sentences in experimental corpora of 11,000 sentence pairs.
| Corpora | NS | CL | WL | |||
|---|---|---|---|---|---|---|
| NW | NW/NS | NS | NW/NS | |||
| Chinese | Training | 9,900 | 99,026 | 10.0 | 72,541 | 7.3 |
| Developing | 550 | 5,645 | 10.3 | 4,138 | 7.5 | |
| Testing | 550 | 5,598 | 10.2 | 4,092 | 7.4 | |
|
| ||||||
| Vietnamese | Training | 9,900 | 107,153 | 10.8 | 93,909 | 9.5 |
| Developing | 550 | 6,151 | 11.2 | 5,401 | 9.8 | |
| Testing | 550 | 5,985 | 10.9 | 5,272 | 9.6 | |
Distribution of number of words and number of sentences in experimental corpora of 22,000 sentence pairs.
| Corpora | NS | CL | WL | |||
|---|---|---|---|---|---|---|
| NW | NW/NS | NW | NW/NS | |||
| Chinese | Training | 19,800 | 196,903 | 9.9 | 144,475 | 7.3 |
| Developing | 1,100 | 11,292 | 10.3 | 8,237 | 7.5 | |
| Testing | 1,100 | 11,056 | 10.1 | 8,090 | 7.4 | |
|
| ||||||
| Vietnamese | Training | 19,800 | 211,179 | 10.7 | 185,346 | 9.4 |
| Developing | 1,100 | 12,028 | 10.9 | 10,534 | 9.6 | |
| Testing | 1,100 | 11,803 | 10.7 | 10,376 | 9.4 | |
Distribution of number of words and number of sentences in experimental corpora of 33,372 sentence pairs.
| Corpora | NS | CL | WL | |||
|---|---|---|---|---|---|---|
| NW | NW/NS | NW | NW/NS | |||
| Chinese | Training | 30,036 | 301,630 | 10.0 | 221,419 | 7.4 |
| Developing | 1,668 | 16,973 | 10.2 | 12,468 | 7.5 | |
| Testing | 1,668 | 17,049 | 10.2 | 12,453 | 7.5 | |
|
| ||||||
| Vietnamese | Training | 30,036 | 316,453 | 10.5 | 278,232 | 9.3 |
| Developing | 1,668 | 17,839 | 10.7 | 15,679 | 9.4 | |
| Testing | 1,668 | 17,745 | 10.6 | 15,617 | 9.4 | |
NS is “number of sentences”, NW is “number of words”, and NW/NS is “NW per NS”.
We used BLEU score and TER score to evaluate the performance of the translation systems.
Distribution of number of sentences into the experimental corpora.
| Case | Training corpus | Developing corpus | Testing corpus |
|---|---|---|---|
| 1 | From sentence 1 to sentence 18 | Sentence 19 | Sentence 20 |
| 2 | From sentence 3 to sentence 20 | Sentence 1 | Sentence 2 |
| 3 | From sentence 2 to sentence 19 | Sentence 20 | Sentence 1 |
| 4 | From sentence 1 to sentence 10 and from sentence 31 to sentence 20 | Sentence 11 | Sentence 12 |
| 5 | From sentence 1 to sentence 8 and from sentence 11 to sentence 20 | Sentence 9 | Sentence 10 |
BLEU scores and TER scores of translation systems.
| CL | WL | Google translate | Our system | |||||
|---|---|---|---|---|---|---|---|---|
| BLEU | TER | BLEU | TER | BLEU | TER | BLEU | TER | |
| 11,000 | 25.54 | 57.65 | 25.87 | 58.22 | 16.90 | 71.06 | 26.17 | 57.13 |
| 22,000 | 28.34 | 53.55 | 28.12 | 53.89 | 16.33 | 71.03 | 28.57 | 53.55 |
| 33,372 | 31.82 | 49.52 | 31.18 | 49.80 | 14.98 | 73.66 | 32.05 | 49.31 |