| Literature DB >> 34357536 |
Annika Tjuka1, Robert Forkel2, Johann-Mattis List2.
Abstract
Psychologists and linguists collect various data on word and concept properties. In psychology, scholars have accumulated norms and ratings for a large number of words in languages with many speakers. In linguistics, scholars have accumulated cross-linguistic information about the relations between words and concepts. Until now, however, there have been no efforts to combine information from the two fields, which would allow comparison of psychological and linguistic properties across different languages. The Database of Cross-Linguistic Norms, Ratings, and Relations for Words and Concepts (NoRaRe) is the first attempt to close this gap. Building on a reference catalog that offers standardization of concepts used in historical and typological language comparison, it integrates data from psychology and linguistics, collected from 98 data sets, covering 65 unique properties for 40 languages. The database is curated with the help of manual, automated, semi-automated workflows and uses a software API to control and access the data. The database is accessible via a web application, the software API, or using scripting languages. In this study, we present how the database is structured, how it can be extended, and how we control the quality of the data curation process. To illustrate its application, we present three case studies that test the validity of our approach, the accuracy of our workflows, and the integrative potential of the database. Due to regular version updates, the NoRaRe database has the potential to advance research in psychology and linguistics by offering researchers an integrated perspective on both fields.Entities:
Keywords: Cross-linguistic comparison; Interdisciplinary database; Linguistic data; Psycholinguistic norms; Ratings; Test-driven data curation; Word and concept properties
Mesh:
Year: 2021 PMID: 34357536 PMCID: PMC9046307 DOI: 10.3758/s13428-021-01650-1
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Subset of the Concepticon database. The table gives information on the language for which the data was collected, the data type (tags), and the item number. The Concepticon currently includes 353 concept lists (Version 2.4.0., List et al., 2020a)
| Concept list | Language(s) | Tags | Items |
|---|---|---|---|
| Bodt and List ( | Kho-Bwa (Tibeto-Burman) | questionnaire, proto-language | 664 |
| Walworth and Shimelman ( | Vanuatu | basic, areal | 215 |
| Hale ( | Nepal | questionnaire, areal | 1798 |
| Bowern ( | Tasmanian | questionnaire | 644 |
| Swadesh ( | Global | basic | 100 |
| Chacon ( | Tukano | areal | 142 |
| Key and Comrie ( | Global | questionnaire | 1310 |
| Dunn et al., ( | Germanic | historical | 12 |
Glossary: Terms occurring in this article and their definitions
| Term | Definition |
|---|---|
| The term ‘word’ refers to a meaningful unit in a particular language. We are aware of the difficulties in defining the term ‘word’ (Haspelmath, | |
| ‘Concept’ is defined as a non-linguistic psychological representation of an object in the world. It includes the knowledge of existing entities and their properties (Murphy, | |
| The term ‘word form’ is the form side of a word understood as a linguistic sign, as given in the orthography or as a sound sequence. | |
| An ‘elicitation gloss’ is used in linguistic fieldwork to denote a given concept in a language. They are established by linguists and are often based on already existing concept lists. Depending on the source language used for the elicitation, the gloss can be in English, Chinese, Spanish, or any other language. | |
| A ‘Concepticon concept set’ (also simply ‘concept set’) consists of a unique identifier, a label, a definition, a semantic field, and an ontological category. The concept identifier (e.g., “1”) is connected to a unique label (e.g., “ | |
| The term ‘concept list’ refers to a compilation of concepts in the form of elicitation glosses. They are used by linguists who want to elicit a concept in a particular language. In contrast to dictionaries, the lists are based on questionnaires and are compiled for language comparison or documentation (List, | |
| A ‘word list’ is defined as a large collection of items that are evaluated for a particular property. The authors of these lists usually do not provide additional information on the intended meaning of a particular word. The list is stored in a tabular format in which a column corresponds to a property and a row represents an observation for a given word. | |
| The term ‘data set’ is used when referring to a word list in addition to its metadata, i.e. the list, the scripts for mapping the list, and the raw data. | |
| ‘Word and concept properties’ are variables that are collected in psychology and linguistics including psycholinguistic measures, network relations, among others. |
Fig. 1Best practice examples for structuring data sets. The data should be in a machine-readable form. Additionally, the information of the column content should be easily understandable by other researchers (for details on how to structure data in R, see Wickham, 2014)
Subset of the NoRaRe data. The table gives information on the language for which the data was collected, the data type, the original item number, and the number of matches to the Concepticon concept sets
| Language | Types | Items | Matches | |
|---|---|---|---|---|
| Norms | ||||
| Cai and Brysbaert ( | Chinese | frequency | 99,123 | 1644 |
| Ferrand et al., ( | French | reaction time | 38,840 | 1372 |
| Brysbaert et al., ( | German | frequency | 190,500 | 1291 |
| Cuetos et al., ( | Spanish | frequency | 94,338 | 1088 |
| Alonso et al., ( | Spanish | frequency | 67,979 | 1016 |
| Tsang et al., ( | Chinese | reaction time | 25,156 | 827 |
| Keuleers et al., ( | Dutch | frequency | 437,503 | 640 |
| González-Nosti et al., ( | Spanish | reaction time | 2765 | 554 |
| Mandera et al., ( | Polish | frequency | 377,843 | 215 |
| Ratings | ||||
| Lynott et al., ( | English | sensorimotor | 40,000 | 2437 |
| Brysbaert et al., ( | English | prevalence | 62,000 | 2414 |
| Kuperman et al., ( | English | age-of-acquisition | 30,000 | 2351 |
| Stadthagen-González et al., ( | Spanish | valence, arousal | 14,031 | 932 |
| Moors et al., ( | Dutch | age-of-acquisition, affective* | 4300 | 444 |
| Łuniewska et al., ( | Diverse | age-of-acquisition | 299 | 284 |
| Verheyen et al., ( | Dutch | age-of-acquisition, lexicosemantic, distributional, affective, concreteness, imageability | 1000 | 206 |
| Imbir ( | Polish | age-of-acquisition, affective, concreteness, imageability | 4900 | 159 |
| Kapucu et al., ( | Turkish | discrete emotions, affective, concreteness | 2031 | 75 |
| Relations | ||||
| Wu et al., ( | Global | core vocabulary | 10,000 | 2460 |
| Matisoff ( | Sino-Tibetan (Global) | etymology | 6431 | 2159 |
| Starostin ( | Diverse | sense relation | 7095 | 2020 |
| Rzymski et al., ( | Global | polysemy | 1624 | 1624 |
| Bond and Foster ( | English | WordNet | 4960 | 1309 |
| Dellert and Buch ( | Eurasian | basicness, stability | 1016 | 955 |
| Hill et al., ( | English | semantic similarity | 999 | 524 |
| Calude and Pagel ( | Diverse | stability, frequency | 200 | 200 |
| Baroni and Lenci ( | English | semantic similarity | 200 | 140 |
*The term ‘affective’ summarizes different variables such as arousal, valence, and dominance
Fig. 2Workflows for data curation. a How raw data are converted to unified tabular data formats and consecutively labeled. b Details for the individual steps involved in the linking of the different data to Concepticon
Fig. 3Comparing different kinds of data on word and concept properties as they have been proposed in the literature. a The Concepticon on top of the figure offers standardized concept sets for more than 3000 concepts. b The SUBTLEX data sets offer frequency counts for words across different languages based on subtitles (Brysbaert & New, 2009; Brysbaert et al., 2011). c User-rated collections of psychological categories, such as arousal, have been published for different languages (e.g., Riegel et al., 2015). d The CLICS database allows estimating the semantic similarity of concepts by measuring how often they are colexified in the languages of the world (Rzymski et al., 2020)
Fig. 4A screenshot of the NoRaRe web application (https://digling.org/norare/) illustrating the values for the Concepticon concept set 906 tree across three different data sets (Bond and Foster, 2013; Alonso et al., 2015; Brysbaert & New, 2009)
Fig. 5Distribution of Concepticon concept sets across the NoRaRe data sets. The x-axis gives the number of data sets in which the concept sets occur. The y-axis provides the number of Concepticon concept sets
The 15 most common Concepticon concept sets occurring in 64 up to 74 NoRaRe data sets
| Rank | ID | Concept set | Data sets |
|---|---|---|---|
| 1 | 2009 | 74 | |
| 2 | 1248 | 74 | |
| 3 | 937 | 73 | |
| 4 | 1489 | 72 | |
| 5 | 227 | 71 | |
| 6 | 1343 | 71 | |
| 7 | 1430 | 71 | |
| 8 | 1223 | 69 | |
| 9 | 730 | 67 | |
| 10 | 906 | 67 | |
| 11 | 1221 | 67 | |
| 12 | 1394 | 67 | |
| 13 | 1247 | 64 | |
| 14 | 1297 | 64 | |
| 15 | 1335 | 64 |
Pearson coefficients for the variables arousal, valence, and dominance (see text). The values in parentheses indicate the original numbers, reported in Scott et al., (2019)
| Overlap | Arousal | Valence | Dominance |
|---|---|---|---|
| 1397 (4,073) | 0.57 (0.62) | 0.92 (0.93) | 0.66 (0.69) |
Fig. 6Distribution of the mean values for (left) arousal, (middle) valence, and (right) dominance in Warriner et al., (2013) and Scott et al., (2019) for 1397 Concepticon concept sets
Pearson coefficients for the sensorimotor variables auditory, gustatory, haptic, olfactory, and visual (see text). Abbreviations: AUD auditory; GUS gustatory; HAP haptic; OLF olfactory; VIS visual
| Overlap | AUD | GUS | HAP | OLF | VIS |
|---|---|---|---|---|---|
| 314 | 0.81 | 0.86 | 0.85 | 0.87 | 0.74 |
Fig. 7Distribution of the mean values for the five sensory modalities: a auditory, b gustatory, c haptic, d olfactory, and e visual in Lynott and Connell (2013), Lynott and Connell (2009), and Winter (2016) (manual workflow) and Lynott et al., (2020) (automated workflow) for 314 Concepticon concept sets
Pearson coefficients for pairwise comparison across languages for Multi-SimLex (Vulić et al., 2020), CLICS1 (List et al., 2013), CLICS2 (List et al., 2018), and CLICS3 (Rzymski et al., 2020)
| Overlap (pairs) | Multi-SimLex | CLICS1 | CLICS2 | CLICS3 |
|---|---|---|---|---|
| 252 | 0.68 | 0.34 | 0.25 | 0.27 |