| Literature DB >> 33432630 |
Simon De Deyne1, Danielle J Navarro2, Guillem Collell3, Andrew Perfors1.
Abstract
One of the main limitations of natural language-based approaches to meaning is that they do not incorporate multimodal representations the way humans do. In this study, we evaluate how well different kinds of models account for people's representations of both concrete and abstract concepts. The models we compare include unimodal distributional linguistic models as well as multimodal models which combine linguistic with perceptual or affective information. There are two types of linguistic models: those based on text corpora and those derived from word association data. We present two new studies and a reanalysis of a series of previous studies. The studies demonstrate that both visual and affective multimodal models better capture behavior that reflects human representations than unimodal linguistic models. The size of the multimodal advantage depends on the nature of semantic representations involved, and it is especially pronounced for basic-level concepts that belong to the same superordinate category. Additional visual and affective features improve the accuracy of linguistic models based on text corpora more than those based on word associations; this suggests systematic qualitative differences between what information is encoded in natural language versus what information is reflected in word associations. Altogether, our work presents new evidence that multimodal information is important for capturing both abstract and concrete words and that fully representing word meaning requires more than purely linguistic information. Implications for both embodied and distributional views of semantic representation are discussed.Entities:
Keywords: Affect; Distributional semantics; Multimodal representations; Semantic networks; Visual features
Year: 2021 PMID: 33432630 PMCID: PMC7816238 DOI: 10.1111/cogs.12922
Source DB: PubMed Journal: Cogn Sci ISSN: 0364-0213
Fig. 1Part of the WordNet hierarchy for concrete synsets that are covered by ImageNet. Synsets can occur at different hierarchical levels and are labeled by one or multiple words that can overlap with other branches in the tree; this is illustrated for the case “hedgehog.”
Concrete triad stimuli in Experiment 1
|
|
|
|
|
|
|
|
Abstract triad stimuli in Experiment 1. Category labels refer to the most specific common hypernym in WordNet found at depth [d]
|
|
Pearson correlations and confidence intervals for unimodal and multimodal models. The top panel shows the performance when v corresponds to the distributional linguistic model, while the middle panel v corresponds to the word association baseline. The bottom panel corresponds to purely experiential model where v corresponds to the affective model and is added for completeness. In each panel, the unimodal columns show the performance of that model (v) as well as the two experiential models (v) on either the concrete or the abstract words. The best‐fitting multimodal models combining v and v were found by optimizing the correlation for mixing parameter β and are shown in column . The improvement due to adding experiential information () is shown in column Δr
| Dataset |
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Unimodal | Multimodal | ||||||||||
|
| CI95 |
|
| CI95 | β |
| CI95 | Δ | CI95 | ||
| Concrete | 300 | .64 | [0.57, 0.70] | Visual | .67 | [0.60, 0.73] | .48 | .75 | [0.70, 0.80] | .12 | [0.07, 0.17] |
| Concrete | 300 | .64 | [0.57, 0.70] | Affect | .21 | [0.10, 0.32] | .50 | .68 | [0.62, 0.74] | .04 | [0.02, 0.08] |
| Abstract | 300 | .62 | [0.54, 0.68] | Affect | .51 | [0.43, 0.59] | .58 | .74 | [0.69, 0.79] | .13 | [0.08, 0.19] |
Note that the confidence intervals for Δr are based on testing significant differences for dependent overlapping correlations based on Zou (2007). This approach increases the power to detect an effect compared to Fisher's r to z procedure which assumes independence.
Fig. 2The effect of adding visual or affective experiential information to predict triadic preferences for concrete (first and second panels) and abstract (third panel) word pairs. Each panel shows the unimodal distributional linguistic and word association correlations on the left side of the x‐axis and the unimodal experiential (affective or visual features) correlations on the right side. Intermediate values on the x‐axis indicate multimodal models. In the first panel, visual information is added: Larger β values correspond to models that weight visual feature information more. In the second and third panels, affective information is added: Larger β values correspond to models that weight affective information more. Peak performance for all models usually occurs when about half of the information is experiential.
Fig. 3Evaluation of alternative distributional linguistic models on concrete and abstract words in the triad task. The figure shows the correlations and 95% confidence intervals for unimodal and multimodal (visual left, affective right panel) models using the standard word2vec based on 2B token corpus and two embeddings based on GloVe, trained on a corpus of either of 6B and 840B tokens.
Fig. 4Evaluation of alternative word association models on concrete and abstract words in the triad task. The left panel multimodal model includes visual information; the right multimodal panel includes affective information. The length of the random walk was varied by setting α, and the maximal and minimal values of r were overall similar regardless of α.
Pearson correlation and confidence intervals for correlation differences Δr between unimodal and multimodal visual models. The top part of the table shows the results for the distributional linguistic model, whereas the bottom part shows the results for the word association model
| Dataset |
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unimodal | Multimodal | |||||||||
|
| CI95 |
| CI95 | β |
| CI95 | Δ | CI95 | ||
| MEN | 942 | .79 | [0.77, 0.82] | .66 | [0.62, .70] | 0.38 | .82 | [0.80, 0.84] | .03 | [0.02, 0.04] |
| MTURK‐771 | 260 | .67 | [0.59, 0.73] | .49 | [0.39, 0.58] | 0.38 | .71 | [0.64, 0.76] | .04 | [0.01, 0.08] |
| SimLex‐999 | 300 | .43 | [0.33, 0.52] | .54 | [0.45, 0.61] | 0.55 | .56 | [0.48, 0.64] | .14 | [0.07, 0.21] |
| Silberer (Sem.) | 5,799 | .73 | [0.71, 0.74] | .78 | [0.77, 0.79] | 0.53 | .82 | [0.82, 0.83] | .10 | [0.09, 0.10] |
| Silberer (Vis.) | 5,777 | .59 | [0.57, 0.61] | .74 | [0.73, 0.75] | 0.65 | .75 | [0.74, 0.76] | .16 | [0.15, 0.17] |
| Average | .55 | .55 | .64 | .08 | ||||||
Pearson correlation and confidence intervals for correlation differences Δr between unimodal and multimodal affective models. The top part of the table shows the results for the distributional linguistic model, whereas the bottom part shows the results for the word association model
| Dataset |
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unimodal | Multimodal | |||||||||
|
| CI95 |
| CI95 | β |
| CI95 | Δ | CI95 | ||
| MEN | 1,981 | .80 | [0.78, 0.81] | .31 | [0.27, 0.35] | 0.45 | .80 | [0.78, 0.81] | .00 | [0.00, 0.00] |
| MTURK‐771 | 653 | .70 | [0.66, 0.74] | .26 | [0.19, 0.33] | 0.53 | .71 | [0.67, 0.75] | .01 | [0.00, 0.02] |
| SimLex‐999 | 913 | .45 | [0.39, 0.50] | .33 | [0.27, 0.39] | 0.65 | .63 | [0.47, 0.56] | .07 | [0.04, 0.10] |
| SimVerb‐3500 | 2,926 | .33 | [0.30, 0.36] | .33 | [0.30, 0.36] | 0.68 | .44 | [0.41, 0.47] | .11 | [0.08, 0.13] |
| Silberer (Sem.) | 5,428 | .74 | [0.73, 0.75] | .21 | [0.19, 0.24] | 0.33 | .74 | [0.73, 0.76] | .00 | [0.00, 0.00] |
| Silberer (Vis.) | 5,405 | .60 | [0.58, 0.61] | .16 | [0.14, 0.16] | 0.30 | .60 | [0.58, 0.61] | .00 | [0.00, 0.00] |
| Average | .60 | .27 | .63 | .03 | ||||||
Fig. 5Results of multimodal models (created by combining either distributional linguistic or word association models with visual features) based on pairwise similarity ratings from the Silberer dataset. The plots show correlations between human judgments and models together with 95% confidence intervals (shaded). The two left panels show the semantic and visual judgments for all items, whereas the two right panels show performance on the subset of basic‐level items in which the word pairs belong to the same superordinate category.
Replication of Table 2 restricted to basic‐level word pairs in the Silberer dataset
| Dataset |
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unimodal | Multimodal | |||||||||
|
| CI95 |
| CI95 | β |
| CI95 | Δ | CI95 | ||
| Basic‐Sem. | 1,086 | .46 | [0.41, 0.50] | .56 | [0.52, 0.60] | 0.55 | .61 | [0.58, 0.65] | .16 | [0.12, 0.19] |
| Basic‐Vis. | 1,086 | .35 | [0.29, 0.40] | .67 | [0.64, 0.70] | 0.70 | .68 | [0.64, 0.71] | .33 | [0.28, 0.38] |
Replication of the results reported in Table 4 restricted to abstract word pairs in the MTURK‐771, SimLex‐999, and SimVerb‐3500 datasets
| Dataset |
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Unimodal | Multimodal | |||||||||
|
| CI95 |
| CI95 | β |
| CI95 | Δ | CI95 | ||
| MTURK‐771 | 121 | .67 | [0.56, 0.76] | .31 | [0.14, 0.46] | 0.53 | .70 | [0.59, 0.78] | .02 | [−0.01, 0.06] |
| SimLex‐999 | 336 | .43 | [0.34, 0.52] | .51 | [0.42, 0.58] | 0.68 | .64 | [0.58, 0.70] | .21 | [0.14, 0.28] |
| SimVerb‐3500 | 1,466 | .28 | [0.23, 0.32] | .40 | [0.36, 0.45] | 0.73 | .47 | [0.43, 0.51] | .19 | [0.15, 0.23] |
| Average | .46 | .41 | .60 | .14 | ||||||
Fig. 6Investigation of the role of affective information for abstract words comparing the full set (all pairs) with a subset of abstract words taken from the SimLex‐999 and SimVerb‐3500 datasets. Results qualitatively replicate the finding in Study 1 that adding affective information improves the performance of the distributional linguistic model but not the word association model when the abstractness of the words is considered.
Comparison between Study 2 results and previously published studies. The correlations reported are between the human ratings and either the unimodal distributional linguistic model in question (v), the visual model in that study (v), and the multimodal model that combines both v and v (v)
| Silberer2014 (Semantic judgments) | |||||
|---|---|---|---|---|---|
| Study |
|
|
|
|
|
| Silberer et al. ( | feature ratings | visual attributes ImageNet | .71 | .49 | .68 |
| Lazaridou et al. ( | word2vec | CNN‐features | .62 | .55 | .72 |
| De Deyne, Navarro, Collell and Perfors ( | word2vec | CNN‐features | .73 | .78 | .82 |
| De Deyne et al. ( | word associations | CNN‐features | .84 | .78 | .87 |
Fig. 7A comparison of the unimodal and multimodal model performance in three different distributional linguistic models: word2vec and GloVe‐6B (6 billion words) and GloVe‐840B (840 billion words). Multimodal models reflect the optimized Pearson correlation for mixing parameters β. Top panels show Pearson correlations and 95% confidence intervals for the visual multimodal models (Panel A) and affective multimodal models (Panel B). The bottom panels show the findings when considering a subset of concrete word pairs at the basic level (Panel C) and abstract affective words (Panel D).