| Literature DB >> 35324927 |
Kanglong Liu1, Rongguang Ye2, Liu Zhongzhu3, Rongye Ye3.
Abstract
The present research reports on the use of data mining techniques for differentiating between translated and non-translated original Chinese based on monolingual comparable corpora. We operationalized seven entropy-based metrics including character, wordform unigram, wordform bigram and wordform trigram, POS (Part-of-speech) unigram, POS bigram and POS trigram entropy from two balanced Chinese comparable corpora (translated vs non-translated) for data mining and analysis. We then applied four data mining techniques including Support Vector Machines (SVMs), Linear discriminant analysis (LDA), Random Forest (RF) and Multilayer Perceptron (MLP) to distinguish translated Chinese from original Chinese based on these seven features. Our results show that SVMs is the most robust and effective classifier, yielding an AUC of 90.5% and an accuracy rate of 84.3%. Our results have affirmed the hypothesis that translational language is categorically different from original language. Our research demonstrates that combining information-theoretic indicator of Shannon's entropy together with machine learning techniques can provide a novel approach for studying translation as a unique communicative activity. This study has yielded new insights for corpus-based studies on the translationese phenomenon in the field of translation studies.Entities:
Mesh:
Year: 2022 PMID: 35324927 PMCID: PMC8947138 DOI: 10.1371/journal.pone.0265633
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Genres and Text types in LCMC and ZCTC.
| Genres | Text type | Samples | Proportion |
|---|---|---|---|
| Press | Press reportage | 44 | 8.8% |
| Press editorial | 27 | 5.4% | |
| Press reviews | 17 | 3.4% | |
| General Prose | Religious writing | 17 | 3.4% |
| Instructional Writing | 38 | 7.6% | |
| Popular lore | 44 | 8.8% | |
| Biographies and essays | 77 | 15.4% | |
| Reports and official documents | 30 | 6% | |
| Academic | Academic prose | 80 | 16% |
| Fiction | General fiction | 29 | 5.8% |
| Mystery and detective fiction | 24 | 4.8% | |
| Science fiction | 6 | 1.2% | |
| Adventure fiction | 29 | 5.8% | |
| Romantic fiction | 29 | 5.8% | |
| Humor | 9 | 1.8% | |
| Total | 500 | 100% | |
Performance evaluation of the four classifiers on the test set.
| Classifier | AUC (%) | Accuracy (%) |
|---|---|---|
|
|
|
|
| LDA | 90.03 | 81.33 |
| MLP | 89.97 | 83.33 |
| RF | 88.15 | 82.00 |
Importance coefficient and ranking of features in SVMs model.
| Feature | Coef | Important Rank |
|---|---|---|
| POS bigram | -7.9062 | 1 |
| POS trigram | 5.4895 | 2 |
| Wordform trigram | -5.3873 | 3 |
| Wordform unigram | 3.5080 | 4 |
| Wordform bigram | 3.0918 | 5 |
| POS unigram | -2.9562 | 6 |
| Character | -2.5134 | 7 |
Importance coefficient and ranking of features in LDA model.
| Feature | Coef | Important Rank |
|---|---|---|
| POS bigram | -22.7247 | 1 |
| POS trigram | 16.4629 | 2 |
| Wordform trigram | -12.2450 | 3 |
| Wordform bigram | 7.5781 | 4 |
| Wordform unigram | 4.3513 | 5 |
| Character | -3.6422 | 6 |
| POS unigram | -0.4390 | 7 |
Fig 1SHAP value in the SVMs model.
Fig 2Two-feature Interaction effect in contour plot.
(where (a) is the interaction between POS bigram POS trigram, (b) is the interaction between POS bigram and Wordform trigram, (c) is the interaction between POS trigram and Wordform trigram).