| Literature DB >> 31932593 |
Christoph Rzymski1, Tiago Tresoldi2, Simon J Greenhill3,4, Mei-Shin Wu3, Nathanael E Schweikhard3, Maria Koptjevskaja-Tamm5, Volker Gast6, Timotheus A Bodt7, Abbie Hantgan8, Gereon A Kaiping9, Sophie Chang10, Yunfan Lai3, Natalia Morozova3, Heini Arjava11, Nataliia Hübler3, Ezequiel Koile3, Steve Pepper12, Mariann Proos13, Briana Van Epps14, Ingrid Blanco6, Carolin Hundt6, Sergei Monakhov6, Kristina Pianykh6, Sallona Ramesh6, Russell D Gray3, Robert Forkel3, Johann-Mattis List15.
Abstract
Advances in computer-assisted linguistic research have been greatly influential in reshaping linguistic research. With the increasing availability of interconnected datasets created and curated by researchers, more and more interwoven questions can now be investigated. Such advances, however, are bringing high requirements in terms of rigorousness for preparing and curating datasets. Here we present CLICS, a Database of Cross-Linguistic Colexifications (CLICS). CLICS tackles interconnected interdisciplinary research questions about the colexification of words across semantic categories in the world's languages, and show-cases best practices for preparing data for cross-linguistic research. This is done by addressing shortcomings of an earlier version of the database, CLICS2, and by supplying an updated version with CLICS3, which massively increases the size and scope of the project. We provide tools and guidelines for this purpose and discuss insights resulting from organizing student tasks for database updates.Entities:
Mesh:
Year: 2020 PMID: 31932593 PMCID: PMC6957499 DOI: 10.1038/s41597-019-0341-x
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Example of a colexification network. A strong link between ARM and HAND is shown, showing that in many languages both concepts are expressed with the same word; among others, weaker links between concepts HAND and FIVE, explainable by the number of fingers on a hand, and ELBOW and KNEE, explainable as both being joints, can also be observed.
Fig. 2Raw data as a starting point for applying the data curation workflow. The table shows a screenshot of a snippet from the source of the yanglalo dataset.
Fig. 3A diagram representing the six fundamental steps of a CLDF dataset preparation workflow.
Fig. 4A diagram representing the workflow for installing, preparing, and using CLICS.
The twenty most common colexifications for CLICS3, as the output of command clics colexifications.
| Concept A | Concept B | Families | Languages | Words |
|---|---|---|---|---|
| WOOD | TREE | 59 | 348 | 361 |
| MOON | MONTH | 57 | 324 | 327 |
| FINGERNAIL | CLAW | 55 | 236 | 243 |
| LEG | FOOT | 52 | 349 | 358 |
| KNIFE (FOR EATING) | KNIFE | 51 | 268 | 282 |
| SON-IN-LAW (OF MAN) | SON-IN-LAW (OF WOMAN) | 49 | 261 | 280 |
| SKIN | BARK | 49 | 209 | 213 |
| WORD | LANGUAGE | 49 | 148 | 149 |
| ARM | HAND | 48 | 294 | 300 |
| LISTEN | HEAR | 48 | 107 | 109 |
| MEAT | FLESH | 47 | 252 | 262 |
| DAUGHTER-IN-LAW (OF WOMAN) | DAUGHTER-IN-LAW (OF MAN) | 47 | 234 | 256 |
| SKIN | LEATHER | 46 | 236 | 258 |
| BLUE | GREEN | 46 | 195 | 204 |
| MALE (OF ANIMAL) | MALE (OF PERSON) | 45 | 145 | 163 |
| WOMAN | WIFE | 44 | 289 | 301 |
| DISH | PLATE | 44 | 155 | 170 |
| FEMALE (OF PERSON) | FEMALE (OF ANIMAL) | 44 | 146 | 154 |
| EARTH (SOIL) | LAND | 43 | 159 | 167 |
| PATH | ROAD | 43 | 133 | 153 |
Fig. 5Colexification clusters in CLICS3.
Datasets included in CLICS3, along with individual counts for glosses (“Glosses”), concepts mapped to Concepticon (“Concepts”), language varieties (“Varieties”), language varieties mapped to Glottolog (“Glottocodes”), and language families (“Families”); new datasets included for the CLICS3 release are also indicated. Each dataset was published as an independent work on Zenodo, as per the respective citations.
| Dataset | Source | Glosses | Concepticon | Varieties | Glottocodes | Families | New | |
|---|---|---|---|---|---|---|---|---|
| 1 | abrahammonpa[ | [ | 304 | 304 | 30 | 16 | 2 | Yes |
| 2 | allenbai[ | [ | 499 | 499 | 9 | 9 | 1 | |
| 3 | bantubvd[ | [ | 420 | 415 | 10 | 10 | 1 | |
| 4 | beidasinitic[ | [ | 736 | 735 | 18 | 18 | 1 | |
| 5 | bodtkhobwa[ | [ | 553 | 536 | 8 | 8 | 1 | Yes |
| 6 | bowernpny[ | [ | 338 | 338 | 175 | 172 | 1 | |
| 7 | castrosui[ | [ | 510 | 508 | 16 | 3 | 1 | Yes |
| 8 | chenhmongmien[ | [ | 793 | 793 | 22 | 20 | 1 | Yes |
| 9 | diacl[ | [ | 537 | 537 | 371 | 351 | 25 | Yes |
| 10 | halenepal[ | [ | 699 | 662 | 13 | 13 | 2 | Yes |
| 11 | hantganbangime[ | [ | 299 | 299 | 22 | 22 | 5 | Yes |
| 12 | hubercolumbian[ | [ | 346 | 345 | 69 | 65 | 16 | |
| 13 | ids[ | [ | 1310 | 1308 | 320 | 275 | 60 | |
| 14 | kraftchadic[ | [ | 433 | 428 | 66 | 59 | 2 | |
| 15 | lexirumah[ | [ | 604 | 602 | 357 | 231 | 12 | Yes |
| 16 | logos[ | [ | 707 | 707 | 5 | 5 | 1 | Yes |
| 17 | marrisonnaga[ | [ | 580 | 572 | 40 | 39 | 1 | Yes |
| 18 | mitterhoferbena[ | [ | 342 | 335 | 13 | 13 | 1 | Yes |
| 19 | naganorgyalrongic[ | [ | 969 | 877 | 10 | 8 | 1 | Yes |
| 20 | northeuralex[ | [ | 952 | 951 | 107 | 107 | 21 | |
| 21 | robinsonap[ | [ | 391 | 391 | 13 | 13 | 1 | |
| 22 | satterthwaitetb[ | [ | 418 | 418 | 18 | 18 | 1 | |
| 23 | sohartmannchin[ | [ | 279 | 279 | 8 | 7 | 1 | Yes |
| 24 | suntb[ | [ | 929 | 929 | 49 | 49 | 1 | |
| 25 | tls[ | [ | 1140 | 811 | 126 | 107 | 1 | |
| 26 | transnewguineaorg[ | [ | 904 | 865 | 1004 | 760 | 106 | Yes |
| 27 | tryonsolomon[ | [ | 317 | 314 | 111 | 96 | 5 | |
| 28 | wold[ | [ | 1459 | 1458 | 41 | 41 | 24 | |
| 29 | yanglalo[ | [ | 875 | 869 | 7 | 7 | 1 | Yes |
| 30 | zgraggenmadang[ | [ | 311 | 310 | 98 | 98 | 1 | |
| TOTAL | 2906 | 3156 | 2271 | 200 |
Fig. 6Increase in data points (values) for CLICS3.
Fig. 7Distribution of language varieties in CLICS3.
| Measurement(s) | Meaning • word |
| Technology Type(s) | digital curation • computational modeling technique • Semantics |
| Factor Type(s) | source data • language |