| Literature DB >> 33285823 |
Ximena Gutierrez-Vasques1, Victor Mijangos2.
Abstract
We propose a quantitative approach for quantifying morphological complexity of a language based on text. Several corpus-based methods have focused on measuring the different word forms that a language can produce. We take into account not only the productivity of morphological processes but also the predictability of those morphological processes. We use a language model that predicts the probability of sub-word sequences within a word; we calculate the entropy rate of this model and use it as a measure of predictability of the internal structure of words. Our results show that it is important to integrate these two dimensions when measuring morphological complexity, since languages can be complex under one measure but simpler under another one. We calculated the complexity measures in two different parallel corpora for a typologically diverse set of languages. Our approach is corpus-based and it does not require the use of linguistic annotated data.Entities:
Keywords: TTR; entropy rate; language complexity; language model; morphology
Year: 2019 PMID: 33285823 PMCID: PMC7516478 DOI: 10.3390/e22010048
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
General information about the parallel corpora.
| Corpus | Languages Covered | Total Tokens | Avg. Tokens Per Language |
|---|---|---|---|
| Bibles | 47 | 1.1 M | 24.8 K |
| JW300 | 133 | 22.4 M | 168.9 K |
Toy example of a stochastic matrix using the trigrams contained in the word ‘cats’. The symbols indicate beginning/end of a word.
| #ca | cat | ats | ts$ | |
|---|---|---|---|---|
| #ca | 0.01 | 0.06 | 0.07 | 0.33 |
| cat | 0.9 | 0.04 | 0.05 | 0.22 |
| ats | 0.06 | 0.78 | 0.05 | 0.23 |
| ts$ | 0.03 | 0.12 | 0.83 | 0.22 |
Figure 1Neural probabilistic language model architecture, are n-grams.
Complexity measures on the Bibles corpus (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship); bold numbers indicate the highest and the lowest values for each measure, the rank is in brackets.
| Language | H | H | TTR | |||
|---|---|---|---|---|---|---|
| Arabic | 0.726 (3) | 0.748 (4) | 0.31 (3) | 0.333 (2) | 0.333 (3) | 0.333 (2) |
| Burmese | 0.74 (2) | 0.823 (2) | ||||
| Eastern Oromo | 0.652 (10) | 0.573 (22) | 0.196 (9) | 0.105 (7) | 0.065 (18) | 0.073 (12) |
| English | 0.703 (5) | 0.667 (10) | 0.082 (19) | 0.083 (11) | 0.069 (16) | 0.088 (10) |
| Fijian | 0.569 (19) | 0.519 (24) | 0.048 (24) | 0.047 (21) | 0.045 (24) | |
| Finnish | 0.696 (6) | 0.59 (20) | 0.266 (5) | 0.182 (5) | 0.08 (9) | 0.097 (8) |
| French | 0.607 (17) | 0.609 (18) | 0.139 (12) | 0.069 (16) | 0.067 (17) | 0.064 (18) |
| Georgian | 0.632 (12) | 0.67 (9) | 0.238 (6) | 0.105 (7) | 0.133 (5) | 0.107 (5) |
| German | 0.588 (18) | 0.664 (12) | 0.136 (13) | 0.065 (17) | 0.08 (9) | 0.07 (13) |
| Hausa | 0.61 (16) | 0.614 (17) | 0.098 (18) | 0.059 (19) | 0.057 (21) | 0.059 (21) |
| Hindi | 0.54 (22) | 0.729 (6) | 0.057 (22) | 0.045 (23) | 0.071 (13) | 0.06 (20) |
| Indonesian | 0.662 (9) | 0.599 (19) | 0.115 (17) | 0.077 (12) | 0.056 (22) | 0.067 (16) |
| Korean | 0.348 (2) | 0.074 (14) | 0.667 (1) | 0.107 (5) | ||
| Modern Greek | 0.683 (7) | 0.655 (14) | 0.181 (10) | 0.118 (6) | 0.083 (8) | 0.097 (8) |
| Malagasy (Plateau) | 0.568 (20) | 0.14 (11) | 0.065 (17) | 0.056 (22) | 0.054 (23) | |
| Russian | 0.732 (5) | 0.225 (8) | 0.222 (4) | 0.154 (4) | 0.214 (3) | |
| Sango | 0.538 (23) | 0.56 (23) | ||||
| Spanish | 0.647 (11) | 0.656 (13) | 0.133 (15) | 0.077 (12) | 0.071 (13) | 0.077 (11) |
| Swahili | 0.613 (14) | 0.576 (21) | 0.233 (7) | 0.091 (9) | 0.071 (13) | 0.07 (13) |
| Tagalog | 0.632 (12) | 0.629 (16) | 0.121 (16) | 0.071 (15) | 0.063 (19) | 0.068 (15) |
| Thai | 0.554 (21) | 0.752 (3) | 0.055 (23) | 0.045 (23) | 0.074 (11) | 0.063 (19) |
| Turkish | 0.705 (4) | 0.63 (15) | 0.297 (4) | 0.25 (3) | 0.105 (6) | 0.13 (4) |
| Vietnamese | 0.406 (24) | 0.684 (8) | 0.066 (20) | 0.045 (23) | 0.071 (13) | 0.058 (22) |
| Western Farsi | 0.67 (8) | 0.705 (7) | 0.135 (14) | 0.091 (9) | 0.095 (7) | 0.103 (7) |
| Yoruba | 0.613 (14) | 0.666 (11) | 0.064 (21) | 0.057 (20) | 0.062 (20) | 0.065 (17) |
Complexity measures on the JW300 corpus (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship); bold numbers indicate the highest and the lowest values for each measure, the rank is in brackets.
| Language | H | H | TTR | |||
|---|---|---|---|---|---|---|
| Arabic | 0.586 (8) | 0.826 (2) | 0.171 (4) | 0.166 (4) | ||
| Burmese | 0.514 (19) | 0.75 (5) | 0.016 (22) | 0.048 (23) | 0.074 (12) | 0.065 (17) |
| Eastern Oromo | 0.552 (14) | 0.568 (23) | 0.111 (6) | 0.1 (10) | 0.068 (16) | 0.069 (15) |
| English | 0.682 (2) | 0.712 (12) | 0.053 (16) | 0.111 (9) | 0.071 (14) | 0.1 (7) |
| Fijian | 0.517 (18) | 0.66 (17) | 0.022 (21) | 0.051 (21) | 0.052 (23) | 0.053 (21) |
| Finnish | 0.563 (10) | 0.628 (20) | 0.181 (2) | 0.095 (6) | 0.096 (9) | |
| French | 0.522 (17) | 0.673 (16) | 0.072 (11) | 0.071 (14) | 0.074 (12) | 0.068 (16) |
| Georgian | 0.563 (10) | 0.728 (9) | 0.175 (2) | 0.153 (6) | 0.181 (2) | 0.136 (3) |
| German | 0.636 (3) | 0.686 (14) | 0.084 (9) | 0.166 (4) | 0.086 (9) | 0.115 (5) |
| Hausa | 0.527 (16) | 0.619 (22) | 0.035 (18) | 0.058 (17) | 0.053 (21) | |
| Hindi | 0.591 (6) | 0.783 (3) | 0.023 (19) | 0.076 (12) | 0.086 (9) | 0.103 (6) |
| Indonesian | 0.556 (12) | 0.624 (21) | 0.051 (17) | 0.068 (15) | 0.052 (23) | 0.06 (19) |
| Korean | 0.349 (24) | 0.057 (14) | 0.052 (20) | 0.133 (4) | 0.076 (14) | |
| Modern Greek | 0.594 (5) | 0.753 (4) | 0.09 (8) | 0.153 (6) | 0.166 (3) | 0.176 (2) |
| Malagasy (Plateau) | 0.499 (22) | 0.062 (12) | 0.058 (17) | 0.054 (22) | 0.05 (24) | |
| Russian | 0.5 (21) | 0.722 (11) | 0.137 (5) | 0.076 (12) | 0.125 (5) | 0.081 (12) |
| Sango | 0.385 (23) | 0.724 (10) | 0.057 (20) | 0.051 (23) | ||
| Spanish | 0.59 (7) | 0.65 (18) | 0.079 (10) | 0.117 (8) | 0.071 (14) | 0.085 (10) |
| Swahili | 0.598 (4) | 0.565 (24) | 0.098 (7) | 0.181 (2) | 0.064 (18) | 0.085 (10) |
| Tagalog | 0.514 (19) | 0.676 (15) | 0.054 (15) | 0.057 (19) | 0.066 (17) | 0.06 (19) |
| Thai | 0.552 (14) | 0.74 (7) | 0.013 (24) | 0.051 (21) | 0.064 (18) | 0.065 (17) |
| Turkish | 0.65 (18) | 0.175 (2) | 0.09 (8) | 0.13 (4) | ||
| Vietnamese | 0.692 (13) | 0.014 (23) | 0.055 (21) | |||
| Western Farsi | 0.569 (9) | 0.738 (8) | 0.061 (13) | 0.09 (11) | 0.095 (6) | 0.1 (7) |
| Yoruba | 0.553 (13) | 0.748 (6) | 0.023 (19) | 0.062 (16) | 0.08 (11) | 0.078 (13) |
Figure 2Different complexity measures (above) and their combinations (below) from Bibles corpus.
Figure 3Different complexity measures (above) and their combinations (below) from JW300 corpus.
Correlation of complexities between the JW300 and Bibles corpora (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship).
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 0.520 | 0.782 | 0.890 | 0.776 | 0.858 | 0.765 |
Spearman’s correlations between measures in the corpus JW300 (all languages considered) (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship).
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 1.0 | 0.271 | 0.423 | 0.839 | 0.471 | 0.788 |
|
| - | 1.0 | 0.112 | 0.238 | 0.746 | 0.64 |
|
| - | - | 1.0 | 0.843 | 0.732 | 0.709 |
|
| - | - | - | 1.0 | 0.72 | 0.892 |
|
| - | - | - | - | 1.0 | 0.909 |
| - | - | - | - | - | 1.0 |
Spearman’s correlations between measures in the Bibles corpus (all languages considered) (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship).
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 1.0 | 0.276 | 0.384 | 0.828 | 0.464 | 0.810 |
|
| - | 1.0 | 0.006 | 0.152 | 0.693 | 0.585 |
|
| - | - | 1.0 | 0.815 | 0.654 | 0.637 |
|
| - | - | - | 1.0 | 0.668 | 0.866 |
|
| - | - | - | - | 1.0 | 0.862 |
|
| - | - | - | - | - | 1.0 |
Spearman’s correlation between , and our complexity measures (H: unigrams entropy; H: trigrams entropy; TTR: Type-token relationship).
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 0.322 | −0.392 | 0.882 | 0.730 | 0.395 | 0.406 |
|
| 0.064 | 0.024 | 0.851 | 0.442 | 0.585 | 0.366 |
Complexity measures on the JW300 corpus (for all languages).
| Language | H | H | TTR | |||
|---|---|---|---|---|---|---|
| Afrikaans | 0.566 (73) | 0.674 (69) | 0.047 (82) | 0.013 (79) | 0.013 (76) | 0.013 (79) |
| Amharic | 0.582 (56) | 0.875 (4) | 0.2 (8) | 0.031 (22) | 0.167 (1) | 0.044 (8) |
| Arabic | 0.586 (53) | 0.827 (6) | 0.171 (15) | 0.029 (25) | 0.095 (4) | 0.041 (12) |
| Azerbaijani | 0.661 (6) | 0.728 (32) | 0.151 (21) | 0.074 (5) | 0.038 (15) | 0.051 (5) |
| Bicol | 0.622 (18) | 0.69 (57) | 0.049 (79) | 0.021 (44) | 0.015 (63) | 0.019 (40) |
| Cibemba | 0.527 (107) | 0.581 (115) | 0.108 (39) | 0.014 (73) | 0.013 (79) | 0.011 (99) |
| Bulgarian | 0.56 (80) | 0.68 (66) | 0.091 (45) | 0.016 (63) | 0.018 (46) | 0.016 (62) |
| Bislama | 0.548 (88) | 0.662 (75) | 0.009 (132) | 0.009 (120) | 0.01 (117) | 0.01 (112) |
| Bengali | 0.546 (90) | 0.801 (9) | 0.06 (69) | 0.013 (83) | 0.026 (26) | 0.018 (46) |
| Cebuano | 0.543 (93) | 0.708 (42) | 0.051 (75) | 0.012 (87) | 0.017 (54) | 0.014 (71) |
| Chuukese | 0.579 (58) | 0.618 (104) | 0.037 (90) | 0.014 (75) | 0.01 (107) | 0.012 (91) |
| Seychelles Creole | 0.593 (46) | 0.645 (87) | 0.024 (107) | 0.013 (77) | 0.01 (107) | 0.012 (85) |
| Czech | 0.668 (4) | 0.777 (13) | 0.125 (29) | 0.061 (10) | 0.048 (9) | 0.065 (4) |
| Danish | 0.617 (22) | 0.695 (53) | 0.063 (65) | 0.023 (36) | 0.017 (55) | 0.021 (34) |
| German | 0.636 (14) | 0.686 (62) | 0.084 (50) | 0.031 (22) | 0.018 (47) | 0.024 (29) |
| Ewe | 0.488 (124) | 0.717 (39) | 0.05 (77) | 0.01 (109) | 0.017 (53) | 0.012 (85) |
| Efik | 0.61 (30) | 0.657 (80) | 0.043 (85) | 0.017 (56) | 0.012 (94) | 0.015 (64) |
| Modern Greek | 0.594 (44) | 0.753 (19) | 0.09 (47) | 0.022 (40) | 0.03 (21) | 0.027 (22) |
| English | 0.682 (3) | 0.713 (41) | 0.053 (74) | 0.026 (29) | 0.017 (51) | 0.025 (26) |
| Spanish | 0.59 (48) | 0.65 (82) | 0.079 (54) | 0.02 (51) | 0.015 (63) | 0.016 (57) |
| Estonian | 0.623 (17) | 0.663 (74) | 0.155 (19) | 0.056 (12) | 0.022 (32) | 0.027 (22) |
| Western Farsi | 0.569 (71) | 0.739 (27) | 0.061 (68) | 0.014 (71) | 0.021 (34) | 0.018 (43) |
| Finnish | 0.563 (75) | 0.628 (96) | 0.184 (9) | 0.024 (34) | 0.019 (40) | 0.017 (52) |
| Fijian | 0.517 (111) | 0.66 (77) | 0.022 (115) | 0.009 (124) | 0.01 (105) | 0.01 (115) |
| French | 0.522 (110) | 0.674 (68) | 0.072 (62) | 0.012 (90) | 0.015 (59) | 0.012 (85) |
| Ga | 0.547 (89) | 0.664 (73) | 0.046 (83) | 0.012 (90) | 0.013 (83) | 0.012 (89) |
| Kiribati | 0.506 (118) | 0.592 (113) | 0.031 (101) | 0.009 (118) | 0.009 (125) | 0.009 (128) |
| Gujarati | 0.542 (95) | 0.835 (5) | 0.048 (81) | 0.011 (93) | 0.023 (28) | 0.017 (53) |
| Gun | 0.575 (65) | 0.691 (55) | 0.024 (108) | 0.012 (92) | 0.012 (91) | 0.013 (81) |
| Hausa | 0.527 (106) | 0.619 (102) | 0.035 (94) | 0.01 (107) | 0.01 (109) | 0.01 (114) |
| Hebrew | 0.595 (43) | 0.763 (17) | 0.17 (16) | 0.034 (19) | 0.061 (6) | 0.039 (13) |
| Hindi | 0.591 (47) | 0.783 (10) | 0.022 (111) | 0.013 (81) | 0.017 (57) | 0.018 (46) |
| Hiligaynon | 0.564 (74) | 0.699 (48) | 0.045 (84) | 0.013 (81) | 0.015 (62) | 0.015 (70) |
| Hiri Motu | 0.543 (94) | 0.604 (111) | 0.012 (128) | 0.009 (122) | 0.008 (131) | 0.009 (129) |
| Croatian | 0.63 (16) | 0.735 (30) | 0.109 (38) | 0.037 (15) | 0.029 (22) | 0.036 (16) |
| Haitian Creole | 0.552 (85) | 0.662 (76) | 0.022 (114) | 0.01 (106) | 0.011 (104) | 0.011 (105) |
| Hungarian | 0.694 (1) | 0.747 (22) | 0.172 (14) | 0.133 (2) | 0.056 (7) | 0.081 (3) |
| Armenian | 0.575 (64) | 0.736 (29) | 0.117 (33) | 0.021 (44) | 0.032 (19) | 0.024 (29) |
| Indonesian | 0.556 (82) | 0.624 (97) | 0.051 (76) | 0.013 (81) | 0.012 (98) | 0.012 (94) |
| Igbo | 0.576 (60) | 0.613 (107) | 0.032 (99) | 0.013 (83) | 0.01 (115) | 0.011 (101) |
| Iloko | 0.611 (29) | 0.64 (89) | 0.08 (53) | 0.024 (32) | 0.014 (71) | 0.018 (48) |
| Icelandic | 0.637 (11) | 0.704 (45) | 0.09 (46) | 0.035 (18) | 0.022 (30) | 0.029 (19) |
| Isoko | 0.569 (70) | 0.656 (81) | 0.02 (116) | 0.011 (99) | 0.01 (111) | 0.011 (102) |
| Italian | 0.595 (40) | 0.614 (106) | 0.082 (52) | 0.022 (41) | 0.013 (87) | 0.015 (65) |
| Japanese | 0.302 (133) | 0.914 (1) | 0.024 (106) | 0.008 (128) | 0.019 (42) | 0.012 (85) |
| Georgian | 0.563 (77) | 0.729 (31) | 0.175 (12) | 0.022 (38) | 0.047 (10) | 0.025 (27) |
| Kongo | 0.534 (100) | 0.619 (103) | 0.022 (112) | 0.009 (114) | 0.009 (127) | 0.01 (122) |
| Greenlandic | 0.538 (98) | 0.623 (99) | 0.335 (1) | 0.02 (47) | 0.02 (38) | 0.015 (65) |
| Cambodian | 0.509 (117) | 0.779 (12) | 0.011 (129) | 0.008 (129) | 0.014 (69) | 0.012 (96) |
| Kannada | 0.587 (52) | 0.754 (18) | 0.239 (3) | 0.036 (16) | 0.095 (4) | 0.041 (10) |
| Korean | 0.349 (131) | 0.907 (2) | 0.057 (71) | 0.01 (110) | 0.027 (24) | 0.015 (69) |
| Kikaonde | 0.553 (83) | 0.541 (127) | 0.087 (48) | 0.015 (68) | 0.011 (99) | 0.012 (96) |
| Kikongo | 0.486 (126) | 0.541 (128) | 0.079 (55) | 0.011 (94) | 0.011 (103) | 0.01 (118) |
| Kirghiz | 0.563 (76) | 0.695 (51) | 0.144 (24) | 0.02 (49) | 0.027 (25) | 0.02 (39) |
| Luganda | 0.601 (36) | 0.539 (129) | 0.14 (25) | 0.033 (20) | 0.013 (79) | 0.016 (61) |
| Lingala | 0.526 (108) | 0.633 (93) | 0.04 (88) | 0.01 (105) | 0.011 (101) | 0.01 (109) |
| Silozi | 0.539 (97) | 0.598 (112) | 0.033 (97) | 0.01 (103) | 0.01 (119) | 0.01 (116) |
| Lithuanian | 0.637 (13) | 0.706 (43) | 0.167 (17) | 0.067 (9) | 0.033 (18) | 0.041 (10) |
| Kiluba | 0.544 (92) | 0.56 (125) | 0.112 (35) | 0.016 (64) | 0.012 (89) | 0.012 (91) |
| Tshiluba | 0.489 (123) | 0.617 (105) | 0.074 (60) | 0.011 (96) | 0.012 (94) | 0.01 (107) |
| Luvale | 0.545 (91) | 0.525 (133) | 0.145 (23) | 0.018 (55) | 0.013 (83) | 0.012 (90) |
| Mizo | 0.595 (42) | 0.681 (65) | 0.04 (87) | 0.016 (67) | 0.013 (78) | 0.015 (63) |
| Latvian | 0.582 (57) | 0.745 (24) | 0.123 (32) | 0.022 (38) | 0.036 (16) | 0.027 (24) |
| Mauritian Creole | 0.583 (55) | 0.624 (98) | 0.019 (117) | 0.012 (90) | 0.009 (127) | 0.011 (103) |
| Plateau Malagasy | 0.499 (122) | 0.538 (131) | 0.062 (66) | 0.011 (100) | 0.01 (111) | 0.009 (124) |
| Marshallese | 0.587 (51) | 0.718 (38) | 0.022 (113) | 0.012 (86) | 0.013 (76) | 0.015 (67) |
| Macedonian | 0.571 (68) | 0.698 (49) | 0.083 (51) | 0.017 (58) | 0.02 (38) | 0.018 (46) |
| Malayalam | 0.607 (32) | 0.701 (47) | 0.272 (2) | 0.059 (11) | 0.041 (14) | 0.037 (15) |
| Moore | 0.561 (79) | 0.724 (34) | 0.027 (104) | 0.011 (96) | 0.014 (66) | 0.014 (75) |
| Marathi | 0.612 (27) | 0.738 (28) | 0.095 (44) | 0.028 (27) | 0.028 (23) | 0.03 (18) |
| Maltese | 0.616 (24) | 0.683 (63) | 0.075 (59) | 0.024 (33) | 0.016 (58) | 0.021 (36) |
| Burmese | 0.514 (113) | 0.75 (20) | 0.016 (121) | 0.009 (126) | 0.014 (69) | 0.012 (93) |
| Nepali | 0.524 (109) | 0.768 (15) | 0.096 (43) | 0.013 (76) | 0.034 (17) | 0.018 (44) |
| Niuean | 0.389 (129) | 0.646 (86) | 0.013 (125) | 0.008 (131) | 0.009 (122) | 0.009 (131) |
| Dutch | 0.604 (34) | 0.683 (64) | 0.061 (67) | 0.02 (50) | 0.015 (61) | 0.018 (42) |
| Norwegian | 0.605 (33) | 0.723 (35) | 0.056 (72) | 0.019 (53) | 0.019 (42) | 0.021 (34) |
| Sepedi | 0.514 (114) | 0.637 (90) | 0.037 (91) | 0.01 (111) | 0.011 (101) | 0.01 (112) |
| Chichewa | 0.567 (72) | 0.562 (124) | 0.124 (31) | 0.019 (52) | 0.013 (81) | 0.013 (80) |
| Eastern Oromo | 0.552 (86) | 0.568 (121) | 0.111 (36) | 0.016 (61) | 0.013 (85) | 0.012 (88) |
| Ossetian | 0.575 (63) | 0.688 (61) | 0.077 (57) | 0.017 (60) | 0.017 (55) | 0.017 (53) |
| Punjabi | 0.572 (66) | 0.816 (7) | 0.025 (105) | 0.012 (88) | 0.018 (47) | 0.017 (51) |
| Pangasinan | 0.612 (28) | 0.66 (78) | 0.058 (70) | 0.02 (46) | 0.014 (73) | 0.017 (49) |
| Papiamento (Curaçao) | 0.603 (35) | 0.704 (46) | 0.031 (102) | 0.015 (70) | 0.014 (73) | 0.016 (55) |
| Solomon Islands Pidgin | 0.642 (9) | 0.64 (88) | 0.013 (123) | 0.015 (69) | 0.009 (122) | 0.014 (76) |
| Polish | 0.617 (23) | 0.745 (23) | 0.152 (20) | 0.047 (13) | 0.047 (10) | 0.045 (6) |
| Ponapean | 0.533 (102) | 0.576 (118) | 0.032 (98) | 0.01 (107) | 0.009 (129) | 0.009 (123) |
| Portuguese | 0.595 (41) | 0.697 (50) | 0.075 (58) | 0.02 (47) | 0.019 (44) | 0.02 (38) |
| Romanian | 0.609 (31) | 0.695 (52) | 0.071 (63) | 0.021 (43) | 0.017 (51) | 0.021 (36) |
| Russian | 0.5 (121) | 0.722 (37) | 0.137 (26) | 0.014 (74) | 0.032 (20) | 0.016 (57) |
| Kirundi | 0.534 (101) | 0.636 (91) | 0.15 (22) | 0.016 (62) | 0.018 (49) | 0.014 (73) |
| Kinyarwanda | 0.599 (38) | 0.57 (120) | 0.134 (28) | 0.03 (24) | 0.014 (73) | 0.016 (59) |
| Sango | 0.385 (130) | 0.725 (33) | 0.01 (130) | 0.008 (133) | 0.012 (91) | 0.01 (111) |
| Sinhala | 0.578 (59) | 0.742 (25) | 0.079 (56) | 0.017 (56) | 0.025 (27) | 0.021 (34) |
| Slovak | 0.614 (26) | 0.767 (16) | 0.124 (30) | 0.036 (17) | 0.043 (12) | 0.042 (9) |
| Slovenian | 0.637 (12) | 0.69 (56) | 0.111 (37) | 0.041 (14) | 0.022 (32) | 0.029 (20) |
| Samoan | 0.536 (99) | 0.629 (95) | 0.017 (119) | 0.009 (117) | 0.009 (125) | 0.01 (121) |
| Shona | 0.622 (19) | 0.538 (130) | 0.18 (10) | 0.069 (7) | 0.014 (67) | 0.019 (41) |
| Albanian | 0.648 (8) | 0.723 (36) | 0.073 (61) | 0.029 (26) | 0.021 (37) | 0.029 (20) |
| Sranantongo | 0.54 (96) | 0.562 (123) | 0.01 (131) | 0.009 (125) | 0.008 (133) | 0.009 (132) |
| Sesotho (Lesotho) | 0.465 (128) | 0.58 (116) | 0.033 (95) | 0.009 (123) | 0.009 (122) | 0.009 (130) |
| Swedish | 0.621 (20) | 0.706 (44) | 0.066 (64) | 0.024 (34) | 0.019 (44) | 0.023 (31) |
| Swahili | 0.598 (39) | 0.566 (122) | 0.098 (41) | 0.025 (31) | 0.012 (93) | 0.015 (68) |
| Swahili (Congo) | 0.562 (78) | 0.586 (114) | 0.098 (41) | 0.017 (59) | 0.013 (82) | 0.013 (82) |
| Tamil | 0.618 (21) | 0.715 (40) | 0.234 (6) | 0.074 (5) | 0.043 (12) | 0.045 (7) |
| Telugu | 0.66 (7) | 0.811 (8) | 0.211 (7) | 0.143 (1) | 0.133 (2) | 0.136 (1) |
| Thai | 0.552 (87) | 0.74 (26) | 0.013 (124) | 0.009 (112) | 0.013 (75) | 0.013 (83) |
| Tigrinya | 0.666 (5) | 0.891 (3) | 0.162 (18) | 0.087 (4) | 0.095 (4) | 0.115 (2) |
| Tiv | 0.576 (61) | 0.659 (79) | 0.017 (120) | 0.011 (94) | 0.01 (113) | 0.012 (98) |
| Tagalog | 0.514 (115) | 0.676 (67) | 0.054 (73) | 0.011 (100) | 0.014 (67) | 0.012 (94) |
| Otetela | 0.529 (105) | 0.605 (110) | 0.085 (49) | 0.013 (78) | 0.013 (88) | 0.011 (100) |
| Setswana | 0.503 (120) | 0.612 (108) | 0.031 (100) | 0.009 (120) | 0.01 (118) | 0.009 (127) |
| Tongan | 0.532 (103) | 0.688 (60) | 0.023 (110) | 0.009 (115) | 0.012 (96) | 0.011 (104) |
| Chitonga | 0.558 (81) | 0.647 (85) | 0.177 (11) | 0.022 (41) | 0.021 (35) | 0.017 (50) |
| Tok Pisin | 0.575 (62) | 0.632 (94) | 0.008 (133) | 0.01 (104) | 0.009 (130) | 0.01 (109) |
| Turkish | 0.684 (2) | 0.65 (83) | 0.175 (13) | 0.133 (2) | 0.021 (35) | 0.031 (17) |
| Tsonga | 0.572 (67) | 0.571 (119) | 0.036 (93) | 0.012 (85) | 0.009 (124) | 0.011 (106) |
| Tatar | 0.593 (45) | 0.689 (58) | 0.116 (34) | 0.025 (30) | 0.022 (31) | 0.022 (32) |
| Chitumbuka | 0.588 (49) | 0.534 (132) | 0.108 (40) | 0.022 (38) | 0.012 (97) | 0.014 (78) |
| Twi | 0.469 (127) | 0.664 (72) | 0.039 (89) | 0.009 (116) | 0.012 (90) | 0.01 (107) |
| Tahitian | 0.487 (125) | 0.669 (70) | 0.012 (127) | 0.008 (130) | 0.01 (111) | 0.009 (126) |
| Ukrainian | 0.601 (37) | 0.775 (14) | 0.136 (27) | 0.031 (22) | 0.049 (8) | 0.038 (14) |
| Umbundu | 0.531 (104) | 0.56 (126) | 0.048 (80) | 0.011 (98) | 0.01 (115) | 0.01 (119) |
| Urdu | 0.631 (15) | 0.781 (11) | 0.033 (96) | 0.018 (54) | 0.019 (42) | 0.025 (28) |
| Venda | 0.512 (116) | 0.619 (101) | 0.031 (103) | 0.009 (118) | 0.01 (114) | 0.009 (125) |
| Vietnamese | 0.344 (132) | 0.692 (54) | 0.014 (122) | 0.008 (131) | 0.011 (100) | 0.01 (117) |
| Waray-Waray | 0.586 (54) | 0.665 (71) | 0.042 (86) | 0.014 (72) | 0.013 (85) | 0.014 (72) |
| Wallisian | 0.517 (112) | 0.577 (117) | 0.013 (126) | 0.008 (127) | 0.008 (132) | 0.008 (133) |
| Xhosa | 0.615 (25) | 0.647 (84) | 0.237 (4) | 0.069 (7) | 0.023 (29) | 0.027 (24) |
| Yapese | 0.639 (10) | 0.635 (92) | 0.018 (118) | 0.016 (65) | 0.01 (120) | 0.014 (76) |
| Yoruba | 0.553 (84) | 0.749 (21) | 0.023 (109) | 0.01 (102) | 0.015 (59) | 0.014 (73) |
| Maya | 0.587 (50) | 0.688 (59) | 0.05 (78) | 0.016 (65) | 0.015 (65) | 0.016 (60) |
| Zande | 0.505 (119) | 0.62 (100) | 0.037 (92) | 0.009 (112) | 0.01 (105) | 0.01 (120) |
| Zulu | 0.57 (69) | 0.609 (109) | 0.235 (5) | 0.027 (28) | 0.018 (50) | 0.016 (55) |
Complexity measures on the Bibles corpus (for all languages).
| Language | H | H | TTR | |||
|---|---|---|---|---|---|---|
| Amele | 0.568 (37) | 0.59 (29) | 0.134 (26) | 0.031 (36) | 0.036 (34) | 0.032 (36) |
| Alamblak | 0.673 (11) | 0.643 (18) | 0.203 (15) | 0.076 (8) | 0.06 (8) | 0.068 (9) |
| Bukiyip | 0.651 (16) | 0.591 (28) | 0.119 (32) | 0.041 (25) | 0.033 (37) | 0.039 (28) |
| Apurinã | 0.592 (29) | 0.523 (43) | 0.205 (14) | 0.046 (19) | 0.035 (36) | 0.034 (33) |
| Mapudungun | 0.598 (27) | 0.596 (27) | 0.145 (20) | 0.041 (25) | 0.042 (20) | 0.04 (26) |
| Egyptian Arabic | 0.725 (5) | 0.748 (4) | 0.31 (4) | 0.222 (2) | 0.25 (3) | 0.23 (2) |
| Barasana-Eduria | 0.526 (45) | 0.577 (35) | 0.146 (19) | 0.031 (36) | 0.037 (31) | 0.03 (40) |
| Chamorro | 0.678 (10) | 0.663 (13) | 0.13 (29) | 0.051 (17) | 0.046 (16) | 0.056 (12) |
| German | 0.588 (30) | 0.663 (13) | 0.136 (24) | 0.037 (29) | 0.054 (12) | 0.044 (18) |
| Daga | 0.585 (32) | 0.545 (41) | 0.095 (39) | 0.028 (40) | 0.025 (44) | 0.026 (44) |
| Modern Greek | 0.683 (9) | 0.655 (16) | 0.181 (17) | 0.076 (8) | 0.06 (8) | 0.071 (7) |
| English | 0.703 (7) | 0.667 (10) | 0.082 (40) | 0.042 (22) | 0.04 (24) | 0.052 (13) |
| Basque | 0.655 (14) | 0.588 (31) | 0.224 (13) | 0.074 (10) | 0.045 (17) | 0.051 (15) |
| Fijian | 0.568 (37) | 0.519 (44) | 0.048 (46) | 0.024 (42) | 0.022 (47) | 0.023 (46) |
| Finnish | 0.696 (8) | 0.589 (30) | 0.266 (6) | 0.142 (5) | 0.055 (10) | 0.068 (9) |
| French | 0.606 (25) | 0.609 (24) | 0.139 (23) | 0.041 (25) | 0.042 (20) | 0.041 (23) |
| Paraguayan Guaraní | 0.613 (21) | 0.642 (19) | 0.174 (18) | 0.051 (17) | 0.054 (12) | 0.051 (15) |
| Eastern Oromo | 0.652 (15) | 0.573 (38) | 0.196 (16) | 0.064 (12) | 0.037 (31) | 0.043 (19) |
| Hausa | 0.609 (24) | 0.613 (23) | 0.098 (38) | 0.032 (32) | 0.032 (39) | 0.035 (30) |
| Hindi | 0.54 (43) | 0.729 (6) | 0.057 (43) | 0.023 (43) | 0.04 (24) | 0.032 (36) |
| Indonesian | 0.661 (13) | 0.598 (26) | 0.115 (34) | 0.042 (22) | 0.033 (37) | 0.041 (23) |
| Popti’ | 0.624 (20) | 0.646 (17) | 0.108 (37) | 0.035 (30) | 0.037 (31) | 0.04 (26) |
| Kalaallisut | 0.572 (35) | 0.455 (47) | 0.542 (2) | 0.054 (15) | 0.04 (24) | 0.035 (30) |
| Georgian | 0.632 (18) | 0.67 (9) | 0.238 (9) | 0.071 (11) | 0.111 (5) | 0.081 (5) |
| West Kewa | 0.573 (34) | 0.583 (33) | 0.113 (35) | 0.028 (40) | 0.029 (41) | 0.029 (41) |
| Halh Mongolian | 0.745 (3) | 0.601 (25) | 0.228 (11) | 0.142 (5) | 0.055 (10) | 0.076 (6) |
| Korean | 0.393 (47) | 0.861 (1) | 0.348 (3) | 0.04 (27) | 0.5 (2) | 0.058 (11) |
| Lango (Uganda) | 0.602 (26) | 0.558 (40) | 0.112 (36) | 0.032 (32) | 0.026 (43) | 0.029 (41) |
| San Miguel El Grande Mixtec | 0.57 (36) | 0.614 (22) | 0.125 (30) | 0.03 (39) | 0.038 (27) | 0.034 (33) |
| Burmese | 0.739 (4) | 0.822 (2) | 0.791 (1) | 0.4 (1) | 0.666 (1) | 0.428 (1) |
| Wichí Lhamtés Güisnay | 0.586 (31) | 0.585 (32) | 0.117 (33) | 0.031 (36) | 0.03 (40) | 0.031 (38) |
| Nama (Namibia) | 0.576 (33) | 0.665 (11) | 0.131 (28) | 0.032 (32) | 0.05 (14) | 0.041 (23) |
| Western Farsi | 0.67 (12) | 0.705 (7) | 0.135 (25) | 0.054 (15) | 0.062 (7) | 0.068 (9) |
| Plateau Malagasy | 0.567 (39) | 0.518 (45) | 0.14 (21) | 0.032 (32) | 0.029 (41) | 0.028 (43) |
| Imbabura Highland Quichua | 0.598 (27) | 0.492 (46) | 0.249 (8) | 0.057 (14) | 0.037 (31) | 0.037 (29) |
| Russian | 0.75 (1) | 0.732 (5) | 0.225 (12) | 0.153 (4) | 0.117 (4) | 0.166 (3) |
| Sango | 0.537 (44) | 0.56 (39) | 0.024 (47) | 0.021 (47) | 0.023 (46) | 0.023 (46) |
| Spanish | 0.647 (17) | 0.656 (15) | 0.133 (27) | 0.045 (20) | 0.047 (15) | 0.05 (17) |
| Swahili | 0.612 (22) | 0.575 (36) | 0.233 (10) | 0.06 (13) | 0.043 (18) | 0.043 (19) |
| Tagalog | 0.632 (18) | 0.629 (20) | 0.121 (31) | 0.04 (27) | 0.038 (27) | 0.042 (21) |
| Thai | 0.554 (41) | 0.752 (3) | 0.055 (44) | 0.023 (43) | 0.042 (20) | 0.034 (33) |
| Turkish | 0.705 (6) | 0.629 (20) | 0.297 (5) | 0.181 (3) | 0.08 (6) | 0.096 (4) |
| Vietnamese | 0.406 (46) | 0.684 (8) | 0.066 (41) | 0.022 (45) | 0.04 (24) | 0.031 (38) |
| Sanumá | 0.546 (42) | 0.574 (37) | 0.05 (45) | 0.022 (45) | 0.024 (45) | 0.024 (45) |
| Yagua | 0.563 (40) | 0.524 (42) | 0.266 (6) | 0.042 (22) | 0.04 (24) | 0.033 (35) |
| Yaqui | 0.748 (2) | 0.579 (34) | 0.14 (21) | 0.086 (7) | 0.036 (34) | 0.052 (13) |
| Yoruba | 0.612 (22) | 0.665 (11) | 0.064 (42) | 0.031 (36) | 0.037 (31) | 0.04 (26) |
complexity for the subset of languages shared with the Bibles corpus.
| Language |
| H | H | TTR | |||
|---|---|---|---|---|---|---|---|
| Amele | 0.456 (9) | 0.568 (17) | 0.59 (13) | 0.134 (13) | 0.066 (17) | 0.076 (16) | 0.069 (18) |
| Apurinã | 0.573 (5) | 0.592 (15) | 0.523 (17) | 0.205 (8) | 0.087 (12) | 0.08 (14) | 0.075 (16) |
| Basque | 0.647 (4) | 0.655 (8) | 0.588 (14) | 0.224 (7) | 0.133 (5) | 0.095 (9) | 0.103 (7) |
| Eastern Oromo | 0.487 (8) | 0.652 (9) | 0.573 (16) | 0.196 (9) | 0.111 (9) | 0.08 (14) | 0.088 (11) |
| Egyptian Arabic | 0.563 (6) | 0.725 (3) | 0.748 (1) | 0.31 (1) | 0.5 (1) | 1.0 (1) | 0.6 (1) |
| English | 0.329 (15) | 0.703 (5) | 0.667 (4) | 0.082 (17) | 0.09 (10) | 0.095 (9) | 0.115 (6) |
| German | 0.397 (13) | 0.588 (16) | 0.663 (6) | 0.136 (12) | 0.071 (14) | 0.111 (5) | 0.088 (11) |
| Halh Mongolian | 0.516 (7) | 0.745 (2) | 0.601 (11) | 0.228 (5) | 0.285 (3) | 0.125 (4) | 0.166 (4) |
| Hausa | 0.322 (16) | 0.609 (13) | 0.613 (10) | 0.098 (16) | 0.069 (15) | 0.076 (16) | 0.076 (15) |
| Imbabura Quichua | 0.662 (3) | 0.599 (14) | 0.492 (19) | 0.25 (3) | 0.117 (8) | 0.09 (12) | 0.083 (14) |
| Indonesian | 0.336 (14) | 0.661 (7) | 0.598 (12) | 0.115 (15) | 0.09 (10) | 0.074 (18) | 0.088 (11) |
| Modern Greek | 0.452 (11) | 0.683 (6) | 0.655 (8) | 0.181 (10) | 0.125 (6) | 0.111 (5) | 0.125 (5) |
| Plateau Malagasy | 0.309 (17) | 0.567 (18) | 0.518 (18) | 0.14 (11) | 0.069 (15) | 0.069 (19) | 0.063 (19) |
| Russian | 0.453 (10) | 0.751 (1) | 0.732 (2) | 0.225 (6) | 0.285 (3) | 0.25 (2) | 0.333 (2) |
| Spanish | 0.44 (12) | 0.647 (10) | 0.656 (7) | 0.133 (14) | 0.083 (13) | 0.095 (9) | 0.096 (8) |
| Swahili | 0.675 (2) | 0.612 (11) | 0.575 (15) | 0.233 (4) | 0.125 (6) | 0.105 (7) | 0.096 (8) |
| Turkish | 0.775 (1) | 0.705 (4) | 0.629 (9) | 0.297 (2) | 0.333 (2) | 0.181 (3) | 0.2 (3) |
| Vietnamese | 0.141 (19) | 0.406 (19) | 0.684 (3) | 0.066 (18) | 0.054 (19) | 0.095 (9) | 0.075 (16) |
| Yoruba | 0.178 (18) | 0.612 (11) | 0.665 (5) | 0.064 (19) | 0.066 (17) | 0.083 (13) | 0.085 (13) |
MCC complexity for the subset of languages shared with the JW300 corpus.
| Language |
| H | H | TTR | |||
|---|---|---|---|---|---|---|---|
| Bulgarian | 96.0 (7) | 0.56 (20) | 0.68 (16) | 0.091 (10) | 0.016 (20) | 0.018 (13) | 0.016 (18) |
| Czech | 195.0 (2) | 0.668 (3) | 0.777 (1) | 0.125 (6) | 0.061 (3) | 0.048 (2) | 0.065 (2) |
| Danish | 15.0 (20) | 0.617 (9) | 0.695 (11) | 0.063 (19) | 0.023 (12) | 0.017 (16) | 0.021 (13) |
| Dutch | 26.0 (19) | 0.604 (13) | 0.683 (15) | 0.061 (20) | 0.02 (18) | 0.015 (19) | 0.018 (16) |
| English | 6.0 (21) | 0.682 (2) | 0.713 (7) | 0.053 (21) | 0.026 (9) | 0.017 (16) | 0.025 (10) |
| Estonian | 110.0 (5) | 0.623 (7) | 0.663 (18) | 0.155 (4) | 0.056 (4) | 0.022 (8) | 0.027 (8) |
| Finnish | 198.0 (1) | 0.563 (19) | 0.628 (20) | 0.184 (1) | 0.024 (10) | 0.019 (11) | 0.017 (17) |
| French | 30.0 (18) | 0.522 (21) | 0.674 (17) | 0.072 (16) | 0.012 (21) | 0.015 (19) | 0.012 (21) |
| German | 38.0 (16) | 0.636 (6) | 0.686 (14) | 0.084 (12) | 0.031 (8) | 0.018 (13) | 0.024 (11) |
| Hungarian | 94.0 (8) | 0.694 (1) | 0.747 (4) | 0.172 (2) | 0.133 (1) | 0.056 (1) | 0.081 (1) |
| Italian | 52.0 (13) | 0.595 (14) | 0.614 (21) | 0.082 (13) | 0.022 (14) | 0.013 (21) | 0.015 (20) |
| Latvian | 81.0 (9) | 0.582 (18) | 0.745 (5) | 0.123 (8) | 0.022 (14) | 0.036 (5) | 0.027 (8) |
| Lithuanian | 152.0 (3) | 0.637 (4) | 0.706 (8) | 0.167 (3) | 0.067 (2) | 0.033 (6) | 0.041 (5) |
| Modern Greek | 50.0 (14) | 0.594 (16) | 0.753 (3) | 0.09 (11) | 0.022 (14) | 0.03 (7) | 0.027 (8) |
| Polish | 112.0 (4) | 0.617 (9) | 0.745 (5) | 0.152 (5) | 0.047 (5) | 0.047 (3) | 0.045 (3) |
| Portuguese | 77.0 (10) | 0.595 (14) | 0.697 (10) | 0.075 (15) | 0.02 (18) | 0.019 (11) | 0.02 (15) |
| Romanian | 60.0 (12) | 0.609 (12) | 0.695 (11) | 0.071 (17) | 0.021 (16) | 0.017 (16) | 0.021 (13) |
| Slovak | 40.0 (15) | 0.614 (11) | 0.767 (2) | 0.124 (7) | 0.036 (7) | 0.043 (4) | 0.042 (4) |
| Slovenian | 100.0 (6) | 0.637 (4) | 0.69 (13) | 0.111 (9) | 0.041 (6) | 0.022 (8) | 0.029 (6) |
| Spanish | 71.0 (11) | 0.59 (17) | 0.65 (19) | 0.079 (14) | 0.02 (18) | 0.015 (19) | 0.016 (18) |
| Swedish | 35.0 (17) | 0.621 (8) | 0.706 (8) | 0.066 (18) | 0.024 (10) | 0.019 (11) | 0.023 (12) |
Spearman’s correlation between complexity measures in concatenative and isolating languages (Bibles corpus).
| H | H | TTR | ||
|---|---|---|---|---|
|
| H | 1.0 | 0.233 | 0.618 |
| H | - | 1.0 | −0.121 | |
| TTR | - | - | 1.0 | |
|
| H | 1.0 | −0.355 | 0.513 |
| H | - | 1.0 | −0.178 | |
| TTR | - | - | 1.0 |
Spearman’s correlation between complexity measures in concatenative and isolating languages (JW300 corpus).
| H | H | TTR | ||
|---|---|---|---|---|
|
| H | 1.0 | −0.12 | 0.296 |
| H | −12 | 1.0 | −0.369 | |
| TTR | - | - | 1.0 | |
|
| H | 1.0 | −0.011 | 0.438 |
| H | - | 1.0 | −0.741 | |
| TTR | - | - | 1.0 |
Spearman’s correlation between complexity measures and the average length per word in the Bibles corpus.
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 0.354 | −0.421 | 0.697 | 0.628 | 0.141 | 0.278 |
Spearman’s correlation between complexity measures and the average length per word in the JW300 corpus.
| H | H | TTR | TTR+H | TTR+H | TTR+H | |
|---|---|---|---|---|---|---|
|
| 0.296 | −0.359 | 0.735 | 0.606 | 0.265 | 0.315 |