| Literature DB >> 31747404 |
Leah Cathryn Windsor1, James Grayson Cupit1, Alistair James Windsor2.
Abstract
Corpus selection bias in international relations research presents an epistemological problem: How do we know what we know? Most social science research in the field of text analytics relies on English language corpora, biasing our ability to understand international phenomena. To address the issue of corpus selection bias, we introduce results that suggest that machine translation may be used to address non-English sources. We use human translation and machine translation (Google Translate) on a collection of aligned sentences from United Nations documents extracted from the Multi-UN corpus, analyzed with a "bag of words" analysis tool, Linguistic Inquiry Word Count (LIWC). Overall, the LIWC indices proved relatively stable across machine and human translated sentences. We find that while there are statistically significant differences between the original and translated documents, the effect sizes are relatively small, especially when looking at psychological processes.Entities:
Year: 2019 PMID: 31747404 PMCID: PMC6867602 DOI: 10.1371/journal.pone.0224425
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Example of parallel machine translation.
Fig 2Workflow.
Mean correlations of word proportions across LIWC categories and languages.
| LIWC Category | Language translated from | |||||
|---|---|---|---|---|---|---|
| Arabic | German | French | Russian | Mandarin | Mean | |
| All | 0.831 | 0.814 | 0.822 | 0.843 | 0.783 | 0.820 |
| Summary | 0.863 | 0.833 | 0.856 | 0.906 | 0.761 | 0.844 |
| Linguistic Dim. | 0.729 | 0.728 | 0.769 | 0.788 | 0.651 | 0.733 |
| Other Grammar | 0.829 | 0.784 | 0.783 | 0.856 | 0.724 | 0.795 |
| Psych. Proc. | 0.862 | 0.838 | 0.832 | 0.887 | 0.836 | 0.851 |
| Punctuation | 0.787 | 0.813 | 0.837 | 0.614 | 0.728 | 0.771 |
Mean correlations of word counts across LIWC categories and languages.
| LIWC Category | Language translated from | |||||
|---|---|---|---|---|---|---|
| Arabic | German | French | Russian | Mandarin | Mean | |
| All | 0.851 | 0.837 | 0.844 | 0.860 | 0.804 | 0.832 |
| Summary | 0.863 | 0.833 | 0.856 | 0.906 | 0.761 | 0.844 |
| Linguistic Dim. | 0.770 | 0.784 | 0.799 | 0.823 | 0.693 | 0.774 |
| Other Grammar | 0.848 | 0.791 | 0.804 | 0.867 | 0.754 | 0.813 |
| Psychological Proc. | 0.874 | 0.855 | 0.850 | 0.894 | 0.852 | 0.865 |
| Punctuation | 0.826 | 0.823 | 0.870 | 0.669 | 0.740 | 0.786 |
Summary is unchanged as its entries are not proportions and remain unchanged.
LIWC variables with less than 0.8 mean correlation of word proportions.
| Category | Variable |
| Composite | analytic |
| Linguistic Dimension | pronoun, ppron, we, you, shehe, they, ipron, prep, auxverb, adverb |
| Other Grammar | verb, compare, interrog |
| Psychological Processes | sad [Affective processes/Negative Emotions],male [Social processes], discrep [Cognitive processes], see [Perceptual processes], hear [Perceptual processes], reward [Drives], focuspast [Time orientation], focuspresent [Time orientation], focusfuture [Time orientation], motion [Relativity], home [Personal concerns], nonflu [Informal language] |
| Punctuation | Period, semic |
Interpretation of effect sizes.
| Interpretation | |
|---|---|
| 0 ≤ | | Very small |
| 0.01 ≤ | | Small |
| 0.2 ≤ | | Medium |
| 0.5 ≤ | | Large |
LIWC variables showing medium or greater effect size.
| LIWC Variable | Language Translated From | |||
|---|---|---|---|---|
| Arabic | German | French | Chinese | |
| wps | 0.135 | -0.409 | -0.01 | -0.586 |
| dic | -0.014 | -0.036 | -0.016 | -0.305 |
| function | -0.05 | 0.049 | 0.095 | -0.598 |
| pronoun | -0.015 | 0.127 | 0.345 | -0.257 |
| you | 0.287 | 0.105 | 0 | 0.089 |
| ipron | -0.01 | 0.099 | 0.33 | -0.238 |
| prep | -0.117 | -0.131 | -0.124 | -0.701 |
| negate | 0.002 | -0.039 | 0.031 | 0.541 |
| time | 0.276 | 0 | 0.042 | 0.079 |
| Effect Size | Very Small | Small | Medium | Large |