Literature DB >> 33432630

Visual and Affective Multimodal Models of Word Meaning in Language and Mind.

Simon De Deyne¹, Danielle J Navarro², Guillem Collell³, Andrew Perfors¹.

Abstract

One of the main limitations of natural language-based approaches to meaning is that they do not incorporate multimodal representations the way humans do. In this study, we evaluate how well different kinds of models account for people's representations of both concrete and abstract concepts. The models we compare include unimodal distributional linguistic models as well as multimodal models which combine linguistic with perceptual or affective information. There are two types of linguistic models: those based on text corpora and those derived from word association data. We present two new studies and a reanalysis of a series of previous studies. The studies demonstrate that both visual and affective multimodal models better capture behavior that reflects human representations than unimodal linguistic models. The size of the multimodal advantage depends on the nature of semantic representations involved, and it is especially pronounced for basic-level concepts that belong to the same superordinate category. Additional visual and affective features improve the accuracy of linguistic models based on text corpora more than those based on word associations; this suggests systematic qualitative differences between what information is encoded in natural language versus what information is reflected in word associations. Altogether, our work presents new evidence that multimodal information is important for capturing both abstract and concrete words and that fully representing word meaning requires more than purely linguistic information. Implications for both embodied and distributional views of semantic representation are discussed.

Entities: Chemical Disease Gene Species

Keywords: Affect; Distributional semantics; Multimodal representations; Semantic networks; Visual features

Year: 2021 PMID： 33432630 PMCID： PMC7816238 DOI： 10.1111/cogs.12922

Source DB: PubMed Journal: Cogn Sci ISSN： 0364-0213

Introduction

When you look up the word rose in the 2012 Concise Oxford English Dictionary, it is defined as “a prickly bush or shrub that typically bears red, pink, yellow, or white fragrant flowers, native to north temperate regions and widely grown as an ornamental.” How central are each of these aspects to our representation of a rose, and in what form are they represented? Different theories give different answers to this question, particularly with respect to how much linguistic and nonlinguistic sensory representations contribute to meaning. In embodied theories, meaning is based on the relationship between words and internal bodily states corresponding to multiple modalities, such as somatosensation, vision, olfaction, and perhaps even internal affective states. By contrast, lexico‐semantic views focus on the contribution of language, suggesting that the meaning of rose can be derived in a recursive fashion by considering its relationship to the meaning of words in its linguistic context, such as bush, red, and flower (cf. Firth, 1968). These two views are extremes on a spectrum, and current theories of semantics tend to take an intermediate position that both linguistic and nonlinguistic information contribute to meaning. How is information from sensory modalities and the language modality combined, and which is more important to understand the meaning of words? One idea is that language is a symbolic system that represents meaning via the relationships between (amodal) symbols but is also capable of capturing sensory representations since these symbols refer to perceptual information. This is consistent with the symbol interdependency hypothesis, proposed by Louwerse (2011) and closely related to Clark's (2006) hypothesis that we use language as proxy of the world. Others have argued for hybrid approaches that combine both symbolic and sensorily grounded representations to varying degrees, depending on task and word characteristics (Andrews, Frank, & Vigliocco, 2014; Paivio, 1971; Riordan & Jones, 2011; Vigliocco, Meteyard, Andrews, & Kousta, 2009). One reason for the importance of nonlinguistic sensory information is that although language has the capacity to capture sensory properties, this capacity is imperfect in making fine‐grained distinctions. For example, in Japanese the word ashi does not distinguish between foot and leg. As another example, consider the fact that across languages, commonly used color terms only cover a subset of all the colors that humans can detect. Thus, one would expect that models of lexical semantics should perform better when provided with visual training data in addition to linguistic corpora (e.g., Bruni, Tran, & Baroni, 2014; Johns & Jones, 2012; Silberer & Lapata, 2014). This may be true even for abstract words that lack visual referents: Recent work suggests that nonlinguistic factors such as sensorimotor experience, emotional experience, interoception, and sociality ground the meaning of abstract words (for an overview, see Borghi et al., 2017). This suggests that models incorporating only linguistic information will fare less well in capturing human representations than those that combine linguistic and sensory input. The opposite may also hold—that models incorporating only sensory input will fare less well than those based on both. Not all agree that multimodal information is necessary for the representation of word meaning, at least not for abstract words that lack physical referents (Paivio, 2013). For instance, it is not clear how the meaning of concepts like “romance” or “purity,” which are connotations of the word rose that are not directly derived from sensory impressions, might be captured by sensory‐based models that learn low‐level visual features from images (e.g., Chatfield, Simonyan, Vedaldi, & Zisserman, 2014; He, Zhang, Ren, & Sun, 2016; Lazaridou, Pham, & Baroni, 2015). In this paper, we revisit the question about the extent to which the language modality encodes sensory properties by comparing linguistic representations derived from text corpora or word associations with multimodal models that encode sensory or affective information as well. We focus on concrete and abstract concepts and the role of visual and affective properties. With affective information we refer more specifically to the emotional valence, or how positive/negative a word is and the arousal, or how calming/exciting a word is. This will allow us to determine whether models of abstract word meaning require multimodal affective representations in the same way that models of concrete word meaning do.

Multimodal representations of concrete concepts

Concrete concepts are concepts that refer to a perceptible entity. Multimodal models of concrete concepts combine linguistic and sensory representations to determine to what degree linguistic representations capture sensory properties. In theory, these sensory properties could reflect any modality. In a study by Kiela and Clark (2015), for example, a multimodal model was constructed that combines language with auditory input. In practice, most studies focus on visual multimodal models, and this is reflected in the type of concepts that are modeled (typically concrete nouns) and the way concreteness is measured (typically focusing on the visual modality, see Brysbaert, Warriner, & Kuperman, 2014). Early work relied on psycho‐experimental methods to capture these sensory properties; a common approach is to derive representations from feature elicitation tasks in which participants are asked to list meaningful properties of the concept in question. These (presumably nonlinguistic) features are then combined in a multimodal model whose linguistic representations are derived from text corpora (Andrews et al., 2014; Johns & Jones, 2012; Steyvers, 2010). In recent years, multimodal models have begun to use visual representations derived from large image databases instead of feature listing tasks. These range from approaches in which visual features are obtained by human annotators (Silberer, Ferrari, & Lapata, 2013) to bag‐of‐visual words approaches in which a large set of visual descriptors are mapped onto vectors encoding low‐level scale‐invariant feature transformations (e.g., Bruni et al., 2014). Deep convolutional networks are starting to be used as well, since they typically outperform the low‐level representations captured by simpler feature extraction methods (e.g., Lazaridou et al., 2015).

Multimodal representations of abstract concepts

Because of their reliance on visual information, most previous work on multimodal models has focused on concrete concepts. However, there are good reasons to study abstract concepts as well. For one, they are extremely prevalent in everyday language: 72% of the noun or verb tokens in the British National Corpus are rated by people as more abstract than the noun war, which many would already consider quite abstract (e.g., Lazaridou et al., 2015). Abstract concepts also pose a particular challenge to strong embodiment views, which maintain that meaning primarily relies on nonlinguistic information. According to these theories, concrete and abstract concepts are not substantially different from each other; both are grounded in the same systems that are engaged during perception, action, and emotion (Borghi et al., 2017). The difficulty for such views is that abstract concepts like “opinion” do not have clearly identifiable referents. By contrast, distributional semantic views can easily explain how abstract concepts are represented: in terms of their distributional properties in language (Andrews et al., 2014). A range of other theories about abstract concepts have been proposed in recent years, going beyond either embodied or distributional semantic accounts (for an overview, see Borghi et al., 2017). For example, according to the conceptual metaphor theory, abstract concepts are represented in terms of metaphors derived from concrete domains; this process provides a perceptual grounding for abstract meaning (Lakoff & Johnson, 2008). We focus here on a different view that highlights the role of affect in particular, known as the Affective Embodiment Account (AEA). The AEA posits that abstract concepts are grounded in internal affective states (Vigliocco et al., 2009). The grounding is quite broad, covering not just abstract words for emotions but also words that evoke an affective state like cancer or birthday (Kousta, Vinson, & Vigliocco, 2009). Consistent with this, emotional valence leads to a processing advantage for abstract words in a lexical decision task, even when confounding factors like imageability or context availability are taken into account (Kousta, Vigliocco, Vinson, Andrews, & Del Campo, 2011). Further evidence that supports the AEA comes from a study in which brain activation was predicted by linguistic representations derived from word embedding models as well as experiential (e.g., visual, affective) representations derived from a feature rating task (Wang et al., 2017). Both types of representations were only weakly correlated with each other. When Wang et al. (2017) related the neural activation during a word familiarity judgment task with the linguistic and affective experiential representations using features like valence, arousal, emotion, or social interaction for the same words, they found dissociable neural correlates for each. Notably, affective experiential features were sensitive to areas involved in emotion processing. A subsequent principal component analysis of the whole‐brain activity pattern showed that the first neural principal component, which captured most of the variance in abstract concepts, was correlated significantly with experiential information but not linguistic information. Moreover, the correlation was stronger for valence than other factors such as social interaction, which have been proposed as other factors in which abstract concepts could be embodied.

Current study

This study aims to investigate how and to what extent distributional models based on word associations or word co‐occurrences derived from natural language are able to capture the meaning of concrete and abstract words. In the word association model the meaning of a word is measured as a distribution of respectively weighted associative links encoded in a large semantic network, whereas the distributional linguistic model derives word meaning from word co‐occurrences derived from large text corpora. Both models are considered to be primarily linguistic in nature. However, for word associations, we expect that associations reflect access to not only the language modality but nonlinguistic sensory modalities as well. Our second aim is to measure the extent to which distributional models based on word association and word co‐occurrence models can be improved by adding nonlinguistic visual and affective information. Our research is motivated by several observations. First, recent performance improvements by corpus‐based distributional semantic models (e.g., Mikolov, Chen, Corrado, & Dean, 2013) have led to suggestions that these models might learn representations like humans do (Mandera, Keuleers, & Brysbaert, 2017). Our work evaluates to what extent this is the case. Second, models that learn visual categories have recently shown a similarly striking improvement in their ability to correctly identify a wide range of concrete objects (Chatfield et al., 2014; He et al., 2016; Lazaridou et al., 2015). Our study investigates new multimodal models that integrate visual and linguistic information. Our goals is to evaluate the extent to which such hybrid models capture human performance and to explore what insights they can provide about how humans use environmental cues (linguistic or visual) to build and represent semantic concepts. Our final aim is motivated by the observation that previously reported gains of multimodal models over purely linguistic models in predicting human performance have been fairly modest (e.g., Bruni et al., 2014). This leaves some uncertainty about whether language in itself is sufficiently rich to encode detailed perceptual information (see Louwerse, 2011). This study goes further by looking beyond concepts consisting of concrete nouns to determine whether multimodal representations for abstract words provide better predictions of behavioral measures of word meaning. In particular, we computationally test a corollary from the AEA hypothesis, namely that internal affective states provide the foundation for the meaning of abstract concepts. We achieve these ends by extending existing multimodal models to incorporate both affective and visual multimodal representations. Our performance evaluation focuses on basic‐level concepts within a superordinate category, and it compares text‐based linguistic distributional models to models derived from experimental word association data that might reflect access to both linguistic and experiential representations. As described below, these aspects of our approach enable us to better interpret the performance of the models and what they mean about human cognition.

Comparisons between basic‐level concepts within a superordinate category

In this work, we evaluate models by evaluating their performance on basic‐level concepts (apple, guitar, etc.) belonging to natural language superordinate categories (fruit, musical instruments, etc.). The basic level is the most inclusive one within which the attributes are common to all or most members of the category (Rosch, 1978). Basic‐level concepts are not only the most informative; they are also acquired early and tend to be easy to form an image of (Rosch, Mervis, Gray, Johnson, & Boyes‐Braem, 1976; Tversky & Hemenway, 1984). They are also the default label: For instance, in picture naming studies, subordinate items are usually named at the basic level, even when subordinate names are known (Rosch, 1978). Given these factors as well as the fact that visual features are shared to some extent between basic‐level objects, one might expect comparisons between them (like apples to pears) to provide a sensitive test of unimodal distributional linguistic models. Despite this, the human similarity benchmarks used to evaluate such models in previous work focus more on wider comparisons (like apples to food or tree). These non‐basic‐level and more abstract comparisons might be more readily encoded through language. This could explain the high correlations between model predictions and human similarity judgments found in such studies (e.g., Mandera et al., 2017), as well as the low correlations found in studies involving more concrete and basic‐level objects (De Deyne, Peirsman, & Storms, 2009). In this work, by focusing on basic‐level concepts, we aim to better control the nature of the semantic relationships and thereby clarify the utility of the information encoded in both linguistic models and multimodal models. Of course, while the notion of a basic level applies clearly to concrete concepts, it is not obvious whether abstract concepts have a basic level. Nevertheless, we can make use of the fact that abstract concepts can be described as a part of a taxonomic hierarchy. Linguistic resources such as WordNet (Fellbaum, 1998) distinguish between superordinate, coordinate, and subordinate abstract concepts. For example, a hyponym (subordinate) listed in WordNet for the term envy is covetousness, whereas a hypernym (superordinate) is resentment. This suggests that abstract concepts, like concrete ones, can be described at a kind of “basic” level that has high cue validity and many shared attributes. In this work we rely on WordNet to ensure that all of our basic‐level (concrete and abstract) concepts belong to superordinate concrete categories of similar size.

Comparison of distributional linguistic models to a word association baseline

To better understand to what degree linguistic information can predict human semantic judgments, we will compare distributional (corpus‐based) semantic models to a baseline model designed to capture subjective meaning. A variety of methods have been proposed to measure meaning, including semantic differentials (Osgood, Suci, & Tannenbaum, 1957), feature elicitation (De Deyne et al., 2008; McRae, Cree, Seidenberg, & McNorgan, 2005), and word association responses (De Deyne, Navarro, & Storms, 2013; Kiss, Armstrong, Milroy, & Piper, 1973; Nelson, McEvoy, & Schreiber, 2004). Distributional semantic representations derived from all of these methods appear to reflect both linguistic and nonlinguistic semantic information (Taylor, 2012). For example, modality‐specific brain regions related to imagery are also activated when participants generate semantic features or word associations (Simmons, Hamann, Harenski, Hu, & Barsalou, 2008). In the remainder of this article we will use the term "distributional linguistic model" to refer to a distributional semantic model that uses language corpora. In this study we derive our baseline model of semantic meaning using word association data. This approach makes fewer assumptions than feature elicitation tasks and is relevant for abstract words as well. If, as hypothesized, this model incorporates both linguistic and nonlinguistic information, we would expect that adding visual or affective features would have little impact on the performance of the model. If, as the AEA account suggests, visual and affective information cannot be derived from linguistic information alone, we might expect that distributional corpus‐based models would be improved by adding that information. Alternatively, if it is indeed the case that natural language does encode perceptual and affective features in a sufficiently rich way, then adding visual or affective features to these language models would have little impact on their performance. This possibility is not unreasonable, given recent improvements in the ability of distributional semantic models to predict a variety of human judgments (see, e.g., Mandera et al., 2017). These improved models, which rely on word co‐occurrence predictions (e.g., word2vec, Mikolov et al., 2013) instead of co‐occurrence counts (e.g., latent semantic analysis; Landauer & Dumais, 1997), may thus capture more of the information inherent in linguistic data and provide a better proxy of the world (Clark, 2006). The structure of this paper is as follows. We first describe how the distributional linguistic, word association, and experiential (visual or affective) models are constructed. We then evaluate how the experiential information (visual or affective) augments the performance of the distributional linguistic model and the word association baseline model in two similarity judgment tasks. Study 1 required participants to pick the most related pair out of three words (e.g., vs. daffodil) while Study 2 had participants rate the similarity of pairs like rose and tulip on an ordinal scale.

Constructing unimodal and multimodal models

Distributional linguistic model

The linguistic model captures semantic representation from the distributional properties in the language environment as is, without integrating it with information from other modalities.

Corpus

The model was trained on a combination of different text corpora described in detail in De Deyne, Perfors, and Navarro (2016). The corpora were constructed to provide a reasonably balanced set of texts that is representative of the type and amount of language a person experiences during a lifetime: formal and informal, spoken and written. It contains four parts: (a) a corpus of English movies subtitles; (b) written fiction and nonfiction taken from the Corpus of Contemporary American English (COCA; Davies, 2009); (c) informal language derived from online blogs and websites available through COCA; and (d) the Simple English Wikipedia, a version of Wikipedia with usually shorter articles aimed at students, children, and people who are learning English (SimpleWiki, accessed February 3, 2016). The resulting corpus consisted of 2.16 billion word tokens and 4.17 million word types. It thus encompasses knowledge that is likely available to the average person but is sufficiently generous in terms of the quality and quantity of data to ensure that our models perform similarly to the existing state of the art.

Word2vec embeddings

Word embedding models have recently been proposed as an alternative to count‐based distribution models such as latent semantic analysis (Landauer & Dumais, 1997) or word co‐occurrence count models (e.g., HAL; Burgess, Livesay, & Lund, 1998). One of the most popular word embedding algorithms consists of a simple neural network that predicts word co‐occurrence (Mikolov et al., 2013). In contrast to count‐based models, networks like word2vec are used to predict words from context (the CBOW or “continuous bag‐of‐words” approach) or context from words (the skip‐gram approach), which more closely resembles human learning. Compared to count‐based approaches, these embedding models often lead to better accounts of lexical and semantic processing (Baroni, Dinu, & Kruszewski, 2014; De Deyne, Perfors, et al., 2016; Mandera et al., 2017). To train word2vec on this corpus, we used a CBOW architecture where the learning objective was to predict a word based on the context in which it occurs. A range of parameters determines the efficiency and predictive performance. In this study, parameter values were taken from previous research in which the optimal model was a network trained to predict the context words using a window size of 7 and a 400‐dimensional hidden layer from which word vectors are derived (De Deyne, Perfors, et al., 2016). This was the best model on a wide range of similarity judgment studies, and it performs comparably with other published embeddings.

Word association model

Word associations have long been used as an experimental method to assess the semantic knowledge a person holds about a word (e.g., Deese, 1965). In contrast to a controlled task like feature listing, the free association procedure does not censor the type of response. This makes it suitable for capturing the representations of all kinds of concepts (including abstract ones) and all kinds of semantic relations (including thematic and affective ones). It also avoids dissociating the lexicon in two different types of entities (concepts and features), which allows us to represent all concepts in a single graph. The resulting representation is thought to capture broad aspects of meaning, not solely those reflecting our linguistic experiences (Mollin, 2009), since nonlinguistic (i.e., experiential or affective) information is accessed when participants generate associates (Simmons et al., 2008). If this is correct, then word associations can be best characterized as a multimodal model. However, for the purpose of this exposition, unimodal models will refer to either a distributional linguistic or a word association model, and multimodal models will refer to a combination of these models with either visual or affective information.

Word association data

The current data were collected as part of the Small World of Words project, an ongoing crowd‐sourced project to map the mental lexicon in various languages. The SWOW‐EN2018 data are those reported in De Deyne, Perfors, et al. (2016) and consist of associates given by 88,722 fluent English speakers. Each speaker gave three different responses to between 14 and 18 cue words. For example, a person shown the cue word miracle might respond magic, religion, and Jesus. The dataset contains a total of 12,292 cue words for which at least 300 responses were collected for every cue.

Semantic network

In line with previous work, we constructed a semantic weighted graph from these data in which each edge corresponds to the association frequency between a cue and a target word. The graph was constructed by only including responses that also occurred as a cue word and keeping only those nodes that are part of the largest connected component (i.e., nodes that have both in‐ and out‐going edges). The resulting graph consists of 12,217 nodes, which retains 87% of the original data. Following De Deyne, Navarro, Perfors, and Storms (2016) and De Deyne, Navarro, Perfors, Brysbaert, and Storms (2019), we first transformed the raw association frequencies using positive point‐wise mutual information (PPMI). Next, a mechanism of spreading activation through random walks was used to allow indirect paths of varying length connecting any two nodes to contribute to their meaning. The random walks can be thought of as a vector with the same dimensionality as the number of nodes in the graph where each element corresponds to a weighted sum of direct and indirect paths, with longer paths receiving a lower weight. The random walks implement the idea of spreading activation over a semantic network. To limit the contribution of long paths, a decay parameter α was set to 0.75, in line with De Deyne, Navarro, et al. (2016). This algorithm is similar to other approaches (Austerweil, Abbott, & Griffiths, 2012), but differs by taking an additional PPMI weighting of the graph with indirect paths to avoid a frequency or degree bias and to reduce spurious links (for a discussion, see Newman, 2010).

Visual feature model

We constructed a model based on visual information using ImageNet as the source of the data encoding visual information (De Deyne, Perfors, et al., 2016); it is currently the largest labeled image bank and includes over 14 million images. It consists of images for the concrete nouns represented in WordNet, a language‐inspired knowledge base in which synonymous words are grouped in synsets and connected through a variety of semantic relations (IS‐A, HAS‐A, etc.). With 21,841 synsets, ImageNet captures a large portion of the concrete lexicon represented in WordNet. A small part of synsets encoded in the WordNet is shown in Fig. 1.

Fig. 1

Part of the WordNet hierarchy for concrete synsets that are covered by ImageNet. Synsets can occur at different hierarchical levels and are labeled by one or multiple words that can overlap with other branches in the tree; this is illustrated for the case “hedgehog.” Visual features were obtained by applying ResNet (He et al., 2016), a pre‐trained supervised convolutional neural network, to each concrete synset that had at least 50 images. The 2,048‐dimensional activation of the last layer (before the softmax layer) of the network is taken as a visual feature vector as it contains higher level features. Finally, we obtained a single 2,048‐dimensional vector to represent each WordNet synset by averaging the feature vectors from its associated individual images. Each image vector was then mapped to a WordNet synset. The words in this study corresponded to synset labels that could occur both as inner nodes (e.g., mammal) or leaf nodes (e.g., hedgehog) depending on their level of abstraction. In some cases, a word corresponded to a synset with multiple labels. For example, in Fig. 1 the word hedgehog is found in two different synsets (one labeled “hedgehog, Erinaceus europaeus” and one “porcupine, hedgehog”). To map these synsets to a single word, the synset labels were split and the corresponding image vectors averaged. Of the 18,851 parsed synset labels in ResNet, 4,449 labels were shared with the SWOW‐EN2018 cue words. Of these 4,449 cues, 910 cues mapped onto more than one synset and were averaged. Previous research on high‐dimensional distributional language models has shown that point‐wise mutual information (PMI), which assigns larger weights to specific features, improves model predictions (De Deyne et al., 2019; Recchia & Jones, 2009). An exploratory analysis showed that this was also the case for the image vectors; we therefore use weighted PMI image vectors throughout this work.

Affective feature model

Affective factors like valence or arousal capture a significant portion of the structure in the mental representation of both adjectives (e.g., De Deyne, Voorspoels, Verheyen, Navarro, & Storms, 2014) and nouns (e.g., Osgood et al., 1957).The validity of the subjective judgments of affective factors is supported by recent work that demonstrated that human affective ratings predict the modulation of the potential neural activity signal in areas associated with affective processing (Vigliocco et al., 2013). According to the AEA theory (Vigliocco et al., 2009), these affective factors provide the necessary multimodal grounding for abstract concepts, which lack any physical referents. We constructed the features of the affective model based on human ratings for valence, arousal, and dominance for nearly 14,000 words (Warriner, Kuperman, & Brysbaert, 2013). The ratings by males and females were treated as separate features and supplemented with three additional features for valence, arousal, and dominance from a more recent study based on 20,000 English words (Mohammad, 2018). The norms from the latter study were somewhat different than those from Warriner et al. (2013) in two ways. First, they used best–worst scaling, which resulted in more reliable ratings. Second, they operationalized arousal differently, resulting in ratings that were less correlated with valence. Each individual word was thus represented by nine features: valence, arousal, and potency judgments for men and women from Warriner et al. (2013) and valence, arousal, and dominance judgments from Mohammad (2018). For example, the representation for the word rose would be: [5.9 3.1 6.4 8.1 2.6 6.0 7.8 2.5 3.5]. The first six values correspond to the male ratings for valence, arousal, and potency and the female ratings for valence, arousal, and potency (Warriner et al., 2013). The last three values correspond to rescaled valence, arousal, and dominance for men and women from Mohammad (2018). None of the features were perfectly correlated with each other, which allowed them to each contribute.

Multimodal models that combine linguistic models with experiential representations

To investigate how adding visual or affective information to our linguistic or word association model affects their performance, we created multimodal models that incorporate both kinds of information. There are multiple efficient ways to combine different information sources; these include auto‐encoders (Silberer & Lapata, 2014), Bayesian models (Andrews et al., 2014), and cross‐modal mappings (Collell, Zhang, & Moens, 2017). We employ a late fusion approach in which features from different modalities are concatenated to build multimodal representations. This approach performs relatively well (e.g., Bruni et al., 2014; Johns & Jones, 2012; Kiela & Bottou, 2014) and enables us to investigate the relative contribution of the modalities directly. This is achieved by a single dataset‐level tuning parameter β, ranging from 0 to 1, which allows us to vary and quantify the relative contribution of the different modalities. The multimodal fusion M of the modalities a and b corresponds to where denotes concatenation, v is the vector representation for the linguistic or word association information, and v is the vector representation for the affective or visual information. Since the features for different modalities can have different scales, they were normalized using the L 2‐norm prior to concatenation, which puts all features in a unitary vector space (Kiela & Bottou, 2014).

Study 1: Basic‐level triadic comparisons

The first study compares how well each of the models above predicts human similarity judgments. Participants rated similarity in a triadic comparison task in which they were asked to pick the most related pair out of three words. There were two conditions, one in which the words were concrete and one in which they were abstract. Compared to pairwise judgments using rating scales, the triadic task has several advantages: Humans find relative judgments easier, it avoids scaling issues, and it leads to more consistent results (Li, Malave, Song, & Yu, 2016).

Method

Participants

Forty native English speakers between 18 and 49 years old (21 females, 19 males, average age 35) were recruited in the concrete condition and forty native English speakers aged 19–46 years (16 females, 24 males, average age = 32) in the abstract condition. The participant sample size (and number of stimuli) was informed by a previous work on triadic judgments (De Deyne, Navarro, et al., 2016). Data were collected in two online studies through Prolific Academic. We first collected for the abstract study and when enough participants completed the task, the concrete task was administered. All participants signed an informed consent form and were paid £6/h. The procedures were approved by the Ethics Committee of the University of Adelaide.

Stimuli

The concrete stimuli consisted of 100 triads constructed from a subset of 300 nouns present in the lexicons of all of our models and for which valence and concreteness norms were available (Brysbaert et al., 2014; Warriner et al., 2013). All 300 words were selected from a set of superordinate categories identified in previous work (e.g., Battig & Montague, 1969; De Deyne et al., 2008). Approximately half of the triads belonged to natural kind categories (Fruit, Vegetables, Mammals, Fish, Birds, Reptiles, Insects, Trees) and the other half to human‐made categories (Clothing, Dwellings, Furniture, Kitchen utensils, Musical instruments, Professions, Tools, Vehicles, Weapons). Each triad was composed of three basic‐level words that belong to the same superordinate category (e.g., falcon, flamingo, penguin) and was constructed by randomly combining category exemplars, subject to the constraint that none of the words occurred more than once across any of the triads. Table A1 of Appendix A contains a list of all the stimuli.

Table A1

Concrete triad stimuli in Experiment 1

Natural Kinds

Birds: blackbird–eagle–raven, canary–duck–goose, chicken–parakeet–pigeon, crow–ostrich–owl, falcon–flamingo–penguin, parrot–pelican–turkey. Body parts: ear–leg–thumb, elbow–nipple–skin, face–foot–heel, finger–heart–toe, hand–lip–tongue. Colors: crimson–pink–yellow, green–khaki–purple. Crustaceans: oyster–prawn–shrimp. Fish: carp–eel–herring, cod–octopus–tuna, dolphin–shark–whale, goldfish–jellyfish–trout, salmon–squid–swordfish. Fruit: apricot–pear–raisin, banana–cherry–pineapple, blueberry–fig–mango, coconut–melon–raspberry grape–lemon–lime, kiwi–peach–plum. Geological formation: beach–cave–ravine, crater–glacier–volcano, grass–gully–mountain. Insects: ant–cockroach–leech, beetle–flea–termite, butterfly–ladybug–worm, mosquito–moth–wasp, slug–snail–spider. Mammals: bear–rat–tiger, beaver–goat–horse, camel–cow–otter, cat–dog–gorilla, deer–hamster–lion, elephant–hyena–panther, giraffe–leopard–sheep, kangaroo–mouse–pony, rabbit–walrus–zebra. Reptiles: alligator–cobra–lizard, crocodile–frog–tortoise. Trees: cedar–fir–willow. Vegetables: artichoke–lettuce–tomato, avocado–cabbage–mushroom, broccoli–eggplant–onion, carrot–radish–turnip, cucumber–spinach–zucchini

Artifacts

Breakfast: bread–muffin–oatmeal, jam–sandwich–toast. Buildings: apartment–hotel–temple, cabin–castle–church, office–tent–trailer. Clothing: bikini–jacket–sweater, blouse–gown–suit, coat–parka–swimsuit. Drinks: beer–milk–tea, champagne–lemonade–wine, coffee–vodka–whiskey. Electronic devices: camera–projector–radio, computer–monitor–telephone. Fabrics: cotton–fleece–lace, denim–silk–velvet, linen–satin–wool. Fashion accessories: bracelet–buckle–purse, button–lipstick–watch, necklace–shawl–umbrella. Food: doughnut–fudge–lollipop, hamburger–lasagna–stew, omelet–roll–spaghetti. Furniture: bath–bed–dresser, chair–couch–desk, cupboard–stool–table. Kitchen utensils: blender–mixer–scissors, bottle–bowl–spoon, fork–spatula–toaster, kettle–oven–plate. Music instruments: accordion–banjo–harp, clarinet–drum–piano, flute–harmonica–trombone, guitar–triangle–violin. Professions: farmer–lawyer–secretary, gardener–nurse–scientist, pilot–surgeon–teacher. Sports: archery–boxing–frisbee, baseball–golf–polo, cricket–squash–tennis. Tools: anvil–chisel–hatchet, clamp–crowbar–hoe, rake–spade–wrench. Vehicles: airplane–cab–tractor, boat–limousine–scooter, buggy–ferry–yacht, bus–jeep–sled. Weapons: bomb–grenade–spear, bow–rope–shotgun, cannon–revolver–shield, dagger–harpoon–stick

The abstract stimuli consisted of 100 triads and were also constructed from words that were present in the lexicons of all our models and where norms for concreteness and valence were available (see concrete triads). To ensure that the words were abstract, we removed any words with concreteness ratings of over 3.5 on a 5‐point scale (Brysbaert et al., 2014). The average concreteness was 2.6. Moreover, all words were well known, with at least 90% of participants in the word association study indicating that they knew the word. The resulting set of 29 categories covered a variety of abstract domains, including emotions, attitudes, abilities, social groups, and beliefs. A list of the stimuli together with their category label is presented in Table A2 of Appendix A.

Table A2

Abstract triad stimuli in Experiment 1. Category labels refer to the most specific common hypernym in WordNet found at depth [d]

Ability [5]: aptitude–breadth–invention, daydream–focus–method, fantasy–intellect–talent. Act [5]: capture–expansion–pursuit. Action [6]: admission–courtesy–removal, debut–progress–violence, flutter–rampage–selection, journey–rush–trick. Activity [6]: adoption–work–worship, adventure–training–treatment, arrogance–endeavor–support, betrayal–espionage–hassle, bribery–hoax–struggle, care–monopoly–treason, craft–crusade–raid, crime–research–scramble, custom–education–rehearsal, gaming–restraint–role, mayhem–stealth–theft, mischief–nightlife–violation. Attitude [5]: ideology–socialism–taboo. Basic cognitive process [6]: attention–memory–vogue. Belief [5]: creed–phantom–religion, faith–magic–opinion. Bias [8]: bias–prejudice–racism. Change [7]: breakup–rotation–voyage, gesture–reform–repair. Cognition [4]: estimate–sensation–wisdom, folklore–intuition–regard, ghost–sight–theory, illusion–layout–respect. Cognitive state [7]: certainty–disbelief–mystery. Content [5]: access–essence–idea, agenda–ignorance–rule. Cost [7]: bounty–perk–ransom. Discipline [7]: economics–logic–sociology. Emotion [6]: happiness–panic–tantrum. Feeling [5]: affection–ambition–heartache, amazement–outrage–rapture, anger–ecstasy–relief anguish–dread–joy, awe–ego–love, boredom–devotion–empathy, contempt–grudge–remorse, delight–horror–wonder, desire–relish–vanity, disgust–fondness–grief, dismay–distress–suspense, emotion–enjoyment–envy, fear–jealousy–mood, fetish–lust–wrath, fury–hope–thrill, pity–pride–surprise. Idea [6]: fallacy–notion–plan, feature–scheme–tactic. Location [4]: boundary–empire–zone. Magnitude [5]: depth–majority–size, dimension–limit–number. Person [4]: addict–delegate–slob, ancestor–believer–sir, brute–wanderer–weirdo, celebrity–foreigner–maniac, communist–fool–outsider, corporal–expert–youth, counsel–foe–snob, darling–hero–thinker, disciple–fanatic–killer, dreamer–hick–novice, follower–optimist–sinner, graduate–savior–scoundrel, guardian–liar–moron heir–rebel–supporter, patriot–sweetie–whiz. Physical condition [6]: addiction–insomnia–plague, complaint–harm–phobia, disease–sickness–thirst, frenzy–handicap–hunger. Process [5]: hindsight–insight–sweetness. Psychological state [6]: annoyance–insanity–tension, assurance–interest–sanity, bliss–madness–paranoia. Region [5]: district–homeland–region, frontier–heaven–paradise, hell–premises–territory. Science [8]: algebra–geology–science, astronomy–math–physics. Social group [4]: ally–meeting–monarchy, business–charity–reunion, clan–dynasty–industry, enemy–seminar–sorority, minority–regime–utility. Statement [5]: bargain–comment–summary, covenant–excuse–remark, evasion–notice–reply. Time period [5]: birthday–evening–semester, century–holiday–maturity, childhood–era–year, morning–period–vacation. Transferred property [5]: benefit–legacy–welfare, donation–rent–royalty

To identify superordinate and basic‐level words for abstract categories, we used WordNet (Fellbaum, 1998), selecting stimuli corresponding to categories in the WordNet hierarchy at a depth of 4–8 in the taxonomy. This ensured that the exemplars were neither overly specific nor overly general. For example, the triad bliss–madness–paranoia consists of three words defined at depths 8, 9, and 9 in the hierarchical taxonomy, with the most specific shared category label or hypernym at depth 6.

Procedure

For each of the 100 triads, participants were instructed to select the pair of words (out of the three) that was most related in meaning, or to indicate if any of the words is unknown. They were asked to only consider the word meaning, ignoring superficial properties like letters, sound, or rhyme. The method and procedure for abstract stimuli were identical to that for the concrete except that the example given to participants now contained abstract words. The concrete task took 8 min, and the abstract one, 10 min.

Behavioral results

All three words in over 99% of concrete triads and 96% of abstract triads were considered known by participants. Similarity judgments were calculated by counting how many participants chose each of the three potential pairs. Because the number of judgments varied depending on whether participants judged them to be unknown, they were converted to proportions. The Spearman split‐half reliability was .92 for concrete triads and .90 for abstract triads. These reliabilities provide an upper bound of the correlations that can be achieved with our models.

Model evaluation

Human triad preferences were predicted by calculating the Pearson correlation with the model preferences for either the distributional linguistic, word association, visual, and affective vectors for each of the three word pairs in each triad. The model preferences were calculated using the cosine similarities for all three pairs and rescaling them to sum to one. Note that in all analyses that follow, no results are available for the visual feature model of abstract words for the obvious reason that abstract words are not found in ImageNet. The correspondences between the human preferences and the model predictions measured through Pearson correlations are shown in Table 1.

Table 1

Pearson correlations and confidence intervals for unimodal and multimodal models. The top panel shows the performance when v corresponds to the distributional linguistic model, while the middle panel v corresponds to the word association baseline. The bottom panel corresponds to purely experiential model where v corresponds to the affective model and is added for completeness. In each panel, the unimodal columns show the performance of that model (v) as well as the two experiential models (v) on either the concrete or the abstract words. The best‐fitting multimodal models combining v and v were found by optimizing the correlation for mixing parameter β and are shown in column . The improvement due to adding experiential information () is shown in column Δr

Dataset	n	v_a = Distributional linguistic model
		Unimodal					Multimodal
		rva	CI₉₅	v_b	rvb	CI₉₅	β	rvab	CI₉₅	Δr	CI₉₅
Concrete	300	.64	[0.57, 0.70]	Visual	.67	[0.60, 0.73]	.48	.75	[0.70, 0.80]	.12	[0.07, 0.17]
Concrete	300	.64	[0.57, 0.70]	Affect	.21	[0.10, 0.32]	.50	.68	[0.62, 0.74]	.04	[0.02, 0.08]
Abstract	300	.62	[0.54, 0.68]	Affect	.51	[0.43, 0.59]	.58	.74	[0.69, 0.79]	.13	[0.08, 0.19]

Note that the confidence intervals for Δr are based on testing significant differences for dependent overlapping correlations based on Zou (2007). This approach increases the power to detect an effect compared to Fisher's r to z procedure which assumes independence.

Distributional linguistic versus word association model comparisons

We first compared the performance of the distributional linguistic model against the baseline word association model. The word association model showed a high correlation with the triad preferences in both the concrete, r(298) = .76, CI95 [0.71, 0.80], and abstract tasks, r(298) = .82, CI95 [0.78, 0.86]. The correlations for the distributional linguistic model were considerably lower: r(298) = .64, CI95 [0.56, 0.70] for concrete triads and r(298) = .62, CI95 [0.54, 0.68] for abstract triads. To test whether the difference in correlation was significant, confidence intervals for correlation differences were calculated for dependent overlapping variables (Zou, 2007) using the cocor package in R (Diedenhofen & Musch, 2015). The 95% confidence interval for the difference in correlation was CI95 [0.06, 0.19] for the concrete triads and CI95 [0.14, 0.27] for the abstract triads, suggesting that the word association model better predicted human similarity judgments for both kinds of words.

Visual and affective multimodal model comparisons

For the multimodal models, we were primarily interested in understanding how visual information contributed to the representation of concrete words and how affective information contributed to the representation of abstract words. In contrast to abstract words, it is also feasible to investigate a combination of both visual and affective information for concrete words. For completeness we include this scenario for concrete word with v = Affect as part of the results reported in Table 1. The confidence intervals in the last column of Table 1 indicate that for concrete words, adding visual information helped both the distributional and the word association models. The values corresponding to the visual multimodal model are higher than the values, and in both cases this difference was significant as the confidence interval of the difference did not include zero. However, visual information improved performance of the distributional linguistic model more (0.12 vs. 0.05, respectively). The bootstrapped confidence interval of the difference between Δrs, CI95 [0.01, 0.12], did not include zero, suggesting this difference was significant. This suggests that the word association model may incorporate some visual information that the distributional linguistic model does not. For completeness, Table 1 also shows the results for concrete triads using a multimodal model based on affective information. Affective information by itself only weakly predicts the preferences in concrete triads (r(298) = .21), but it offers a small multimodal gain when combined with the distributional linguistic model (Δr = .04, CI95 [0.02, 0.08]) and the word association model (Δr = .02, CI95 [0.00, 0.05]). For abstract words, we found that affective information improved the performance of the distributional linguistic model substantially (from to ). This improved performance is consistent with the AEA hypothesis by Vigliocco et al. (2009) and suggests that the decisions made by people in the triad task were based in part on affective information. Indeed, the affective model predictions alone did capture a significant portion of the variability in the abstract triads, r(298) = .51, CI95 [0.43, 0.59], which is remarkable considering that the model consists of only nine features. Interestingly, affective information did not improve the performance of word association model, suggesting that word association data already incorporates affective information. Moreover, the multimodal gain for the distributional linguistic model (0.13, compared to 0.00 for word associations) was significantly larger, with a bootstrapped confidence interval of the difference between Δrs CI95 [0.07, 0.19]. To further explore the effect of adding visual or affective information, Fig. 2 plots the performance of each model as a function of the β‐weighted proportion of experiential information added when the words are either concrete (left panel) or abstract (right panel). The word association model performs best, with small improvements from visual features (Δr = .05, β = 0.35, see Table 1). For the concrete triads, we examine the effect of adding visual information. The distributional linguistic model performs worse but improves slightly more when visual features are added (r increase of .07). For completeness, we also included a model where v is affective and v is visual for the concrete triads. For the concrete triads, the affective model, shown in orange and included in Table 1, is easily the worst of the three. For the abstract triads, we examine the effect of adding affective information. When this is done, the word association model does not improve, but the distributional linguistic model improves considerably.

Fig. 2

The effect of adding visual or affective experiential information to predict triadic preferences for concrete (first and second panels) and abstract (third panel) word pairs. Each panel shows the unimodal distributional linguistic and word association correlations on the left side of the x‐axis and the unimodal experiential (affective or visual features) correlations on the right side. Intermediate values on the x‐axis indicate multimodal models. In the first panel, visual information is added: Larger β values correspond to models that weight visual feature information more. In the second and third panels, affective information is added: Larger β values correspond to models that weight affective information more. Peak performance for all models usually occurs when about half of the information is experiential.

Robustness

Thus far, our results support the hypothesis that the model of meaning based on word associations captures visual information in concrete words and affective information in abstract words. The performance of the distributional linguistic model was worse than the word association baseline for both concrete and abstract words. The distributional linguistic models were most improved by adding affective information, consistent with the AEA hypothesis that the meaning of abstract words, like concrete words, relies on experiential information. To what degree do these findings reflect the specific choices we made in setting up our models? To address this question, we tested how robust our results were when compared against alternative models based on different corpora and different embedding techniques.

Distributional linguistic models

First, we investigated whether the linguistic model performance was due to the specific embeddings used. To test this, we chose GloVe embeddings as an alternative distributional linguistic model (Pennington, Socher, & Manning, 2014). In contrast to word2vec embeddings, GloVe does not incrementally learn embeddings but is based on a factorization of a global word co‐occurrence matrix, which can lead to improved predictions in certain tasks. We used the published word vectors for a model of comparable size to our corpus consisting of 6 billion (B) tokens derived from the GigaWord 5 and Wikipedia 2014 corpus. We also included an extremely large corpus consisting of 840B words from the Common Crawl project. As before, the distributional linguistic vectors (v) were combined with visual or affective information (v) to create a multimodal model that was optimized by fitting values of β. Fig. 3 shows the unimodal distributional linguistic model correlations and the optimal multimodal correlations.

Fig. 3

Evaluation of alternative distributional linguistic models on concrete and abstract words in the triad task. The figure shows the correlations and 95% confidence intervals for unimodal and multimodal (visual left, affective right panel) models using the standard word2vec based on 2B token corpus and two embeddings based on GloVe, trained on a corpus of either of 6B and 840B tokens. For unimodal predictions of concrete items, the Glove‐840B was better than word2vec (CI95 [−0.15, −0.04]) and Glove‐6B (CI95 [−0.17, −0.07]), but the smaller Glove‐6B was not significantly different from word2vec. The multimodal predictions of concrete items followed the same pattern, with only GloVe‐840B better than word2vec (CI95 [−0.07, −0.01]), and Glove‐6B (CI95 [−0.07, −0.02]). For the unimodal prediction of abstract items, word2vec outperformed both GloVe‐6B (CI95 [0.02, 0.14]) and GloVe‐840B (CI95 [0.01, 0.13]). However, once affect was added in the multimodal model, the distinct models did not obtain significantly different correlations. Does adding experiential information improve performance for the GloVe models, as it did for word2vec? For concrete items for Glove‐6B, it appeared to: Predictions using Glove‐6B were significant (the Δr = .13 had a confidence interval that did not overlap with zero, CI95 [0.09, 0.19]). The same was true for Glove‐840B (Δr = 0.06, CI95 [0.03, 0.10]). The pattern was the same for abstract words: Adding experiential information resulted in a significant improvement for both GloVe‐6B (Δr = .18, CI95 [0.12, 0.24]) and GloVe‐840B (Δr = .17, CI95 [0.12, 0.24]). To summarize, the unimodal results indicate that a very large corpus based on 840 billion (B) tokens improves performance for concrete items but results in lower correlations for abstract ones. This suggests that visual language about concrete entities might be relatively underrepresented in all but the largest corpora. The current word2vec model based on 2 billion (2B) performs favorably compared to the 6B GloVe embeddings trained on a corpus more similar in size. For the multimodal comparisons, regardless of the nature of the distributional linguistic model, adding experiential information improved performance. Overall, the findings are robust regardless of the corpus or embedding method used. In other words, the results for these distinct distributional linguistic models are not very different, despite both architectural (word2vec vs. GloVe) and corpus differences (2B words for the current corpus, 6B or 840B words used to train GloVe).

Word association model

There were fewer parameters and degrees of freedom in the word association model than the distributional linguistic model, since its behavior is determined by a single activation decay parameter α, which we set at 0.75 in line with previous work (De Deyne, Perfors, et al., 2016). However, this might have had some effect on model performance: Especially for basic‐level comparisons, a high value of α might introduce longer paths, which might add more thematic information at the expense of shorter category‐specific paths. For this reason, we also evaluated performance for other values of α. Fig. 4 shows that performance was reasonably robust over all values of α. Smaller α values did occasionally improve the results, but only for the unimodal results with concrete words: r α=0.05(298) = .80 compared to the default r α=0.75(298) = .76, Δr = −.04, CI95 [−0.07, −0.02]. For abstract words, the optimal performance was obtained when α = 0.85 and was not significantly different from the default setting (α = 0.75). One interpretation of this is that the representation of concrete words does not reflect indirect paths as much as the representation of abstract concepts. Regardless, this pattern suggests that even the modest improvement found when visual information was added to the word association model was probably somewhat overestimated relative to what would have been obtained using the optimal value for α.

Fig. 4

Evaluation of alternative word association models on concrete and abstract words in the triad task. The left panel multimodal model includes visual information; the right multimodal panel includes affective information. The length of the random walk was varied by setting α, and the maximal and minimal values of r were overall similar regardless of α.

Study 2: Pairwise similarity experiments

Study 1 evaluated the performance of distributional linguistic and word association models when participants perform relative similarity judgments between basic‐level concrete or abstract words. To see if these findings generalize to a broader set of concepts and a different paradigm, we evaluated the same models on multiple datasets containing pairwise semantic similarity ratings, including some that were explicitly collected to compare language‐based and (visual) multimodal models. Unlike the concrete triad task in Study 1, most of the existing datasets include a wide range of concrete semantic relations rather than just taxonomic basic‐level ones. According to Rosch et al. (1976), “basic‐level categories are shown to be the most inclusive categories for which a concrete image of the category as a whole can be formed and to be the first categorizations made during the perception of the environment” (p. 382). With this in mind, investigating the similarity of items from different categories (e.g., butter—croissant; passion—justice) might be a relatively insensitive way of gauging the effect of additional visual or affective information. Fortunately, the large size of previously published datasets allows us to impose restrictions and compare items where both words belong to the same basic‐level category, as well as to evaluate only abstract concepts. By comparing concepts of different types, we will be able to investigate whether the results from Study 1 only apply to concrete concepts on the basic level or whether they generalize further. In addition, while most of the new datasets contain primarily concrete nouns, some of them include a sufficient number of abstract words as well. Given the finding in Study 1 that affective information is important to the representation of those concepts, it is essential to determine whether this finding replicates and generalizes to different tasks.

Datasets

We consider five different datasets of pairwise similarity ratings. Three of these datasets, the MEN data (Bruni, Uijlings, Baroni, & Sebe, 2012), the MTURK‐771 data (Halawi, Dror, Gabrilovich, & Koren, 2012), and the SimLex‐999 data (Hill, Reichart, & Korhonen, 2016), are commonly used as a general benchmark for semantic and distributional models. Two more recent datasets were additionally included because they allow us to more directly address the role of visual and affective information. The first one was the Silberer2014 dataset (Silberer & Lapata, 2014), which was collected with the specific purpose of evaluating visual and semantic similarity in multimodal models. The second dataset was SimVerb‐3500 (Gerz, Vulić, Hill, Reichart, & Korhonen, 2016), which contains a substantial number of abstract verbs. It thus allows us to extend our findings beyond concrete nouns to verbs and investigate whether the important role of affective information in abstract concepts found in Study 1 replicates here. Each of the datasets is slightly different in terms of procedures, stimuli, and semantic relations of the word pairs being judged. The next section briefly explains these differences and reports on their internal reliability, which sets a bound on the maximal correlation of the models we want to evaluate.

MEN

The MEN dataset (Bruni, Uijlings, et al., 2012) consists of 3,000 word pairs constructed specifically for testing multimodal models, and thus most words were concrete. The words were selected randomly from a subset of words occurring at least 700 times in the ukWaC and WaCkypedia corpus. Next, semantic vectors derived from these corpora models were used to derive cosine values from the first 1,000 most similar items, 1,000 pairs were sampled from the 1,001–3,000 most similar items, and the last 1,000 items from the remaining items. As a result, the MEN consists of concrete words that cover a wide range of semantic relations. The estimated reliability that serves as an upper bound was ρ = 0.84 (Bruni, Uijlings, et al., 2012).

MTURK‐771

The MTURK‐771 dataset (Halawi et al., 2012) consists of 771 word pairs and was constructed to include various types of relatedness. It consists of frequent nouns taken from WordNet that are synonyms, have a meronymy relation (e.g., leg—table), or a holonymy relation (e.g., table—furniture). The authors converted WordNet to an undirected graph and only included words with graph distances between 1 and 4. The variability of distance and type of relation suggests that this dataset is quite varied. The reliability, calculated as the correlation between randomly split subsets, was 0.90.

SimLex‐999

The SimLex‐999 dataset (Hill et al., 2016) consists of 999 word pairs and is different from all other datasets in that participants were explicitly instructed to ignore (associative) relatedness and only judge “strict similarity.” It also differs from previous approaches by using a more principled selection of items consisting of adjective, verb, and noun concept pairs covering the entire concreteness spectrum. A total of 900 word pairs were selected from all associated pairs in the USF association norms (Nelson et al., 2004) and supplemented with 99 unassociated pairs. None of the pairs consisted of mixed part‐of‐speech. In this task, associated non‐similar pairs in this list would receive low ratings, whereas highly similar (but potentially weakly associated) items would be rated highly. The reported average pairwise agreement was higher for abstract than concrete concepts (ρ = 0.70 vs. ρ = 0.61). This is relevant for current purposes as a separate analysis for abstract concepts is reported below. The overall inter‐rater reliability calculated over split‐half sets was 0.78 (Gerz et al., 2016).

Silberer

The Silberer dataset (Silberer & Lapata, 2014) consists of all possible pairings of the nouns present in the McRae et al. (2005) concept feature norms. For each of the words, 30 randomly selected pairs were chosen to cover the full variation of semantic similarity. The resulting set consisted of 7,569 word pairs. In contrast to previous studies, the participants performed two rating tasks, consisting of both visual and semantic similarity judgments. The inter‐rater reliability, calculated as the average pairwise ρ between the raters, was 0.76 for the semantic judgments and 0.63 for the visual judgments.

SimVerb‐3500

The SimVerb‐3500 dataset (Gerz et al., 2016) consists of 3,500 word pairs and was constructed to remedy the bias in the field toward studying nouns, and thus consists of an extensive set of verb ratings. Like the SimLex‐999 dataset, it was designed to be representative in terms of concreteness and constrained the judgments explicitly by asking participants to judge similarity rather than relatedness. The items were selected from the University of South Florida association norms and the VerbNet verb lexicon (Kipper, Snyder, & Palmer, 2004), which was used to sample a large variety of classes represented in VerbNet. Inter‐rater reliability obtained by correlating individuals with the mean ratings was high (ρ = 0.86). Note that since the visual feature model trained on ImageNet only contains nouns, SimVerb‐3500 cannot be used to study the impact of visual information.

Evaluation of multimodal visual models

The similarity judgments are predicted by the word2vec distributional linguistic model and word association model, each respectively combined with the visual information to create the multimodal model. As in Study 1, the relative contribution of either the distributional linguistic word associations representation versus visual information will be determined by the best fit of the mixing parameter β.

Datasets involving diverse semantic comparisons

We first consider the three datasets corresponding to a mixed list of word pairs covering a variety of taxonomic and relatedness relations. Of these, the MEN and MTURK‐771 datasets are most similar to each other, since they consist of pairs that include both similar and related pairs across various levels of the taxonomic hierarchy. The first set of comparisons involves multimodal models with visual information, and consequently, the analyses are restricted to those pairs that are present in the linguistic, word association, and visual models. The results reveal that adding visual information only improved the word association model for the MEN data but not for MTURK‐771 (Δr respectively .02 CI95 [0.01, 0.02] and .00, CI95 [−0.01, 0.02]). However, visual information significantly improved the distributional linguistic model on both datasets (Δr respectively .03 CI95 [0.02, 0.04] and .04 CI95 [0.01, 0.08]). The SimLex‐999 dataset is slightly different than all others, in that participants were explicitly instructed to only evaluate strict similarity, ignoring any kind of associative relatedness between the two items. The unimodal word association model performed far better than the distributional linguistic model on this dataset (r(298) = .72 vs. .43, Δr = .29, CI95 [0.22, 0.48]). Adding visual information did not improve the performance of the association model but resulted in considerable improvement in the distributional linguistic model (r(298) = .14, CI95 [0.07, 0.21]). A potential explanation for the difference between datasets is that SimLex‐999 focused on strict similarity, whereas MEN and MTURK‐771 covered a broader range of semantic relations, including thematic ones, for which visual similarity is of limited use. Note the relative small number of cases (n = 300) in SimLex‐999. Table 2 reflects the fact that abstract word pairs are not encoded in the visual model, whereas most items are encoded in the affective model in Table 4.

Table 2

Dataset	n	v_a = Distributional linguistic, v_b = Visual
		Unimodal				Multimodal
		rva	CI₉₅	rvb	CI₉₅	β	rvab	CI₉₅	Δr	CI₉₅
MEN	942	.79	[0.77, 0.82]	.66	[0.62, .70]	0.38	.82	[0.80, 0.84]	.03	[0.02, 0.04]
MTURK‐771	260	.67	[0.59, 0.73]	.49	[0.39, 0.58]	0.38	.71	[0.64, 0.76]	.04	[0.01, 0.08]
SimLex‐999	300	.43	[0.33, 0.52]	.54	[0.45, 0.61]	0.55	.56	[0.48, 0.64]	.14	[0.07, 0.21]
Silberer (Sem.)	5,799	.73	[0.71, 0.74]	.78	[0.77, 0.79]	0.53	.82	[0.82, 0.83]	.10	[0.09, 0.10]
Silberer (Vis.)	5,777	.59	[0.57, 0.61]	.74	[0.73, 0.75]	0.65	.75	[0.74, 0.76]	.16	[0.15, 0.17]
Average		.55		.55			.64		.08

Table 4

Pearson correlation and confidence intervals for correlation differences Δr between unimodal and multimodal affective models. The top part of the table shows the results for the distributional linguistic model, whereas the bottom part shows the results for the word association model

Dataset	n	v_a = Distributional linguistic, v_b = Affect
		Unimodal				Multimodal
		rva	CI₉₅	rvb	CI₉₅	β	rvab	CI₉₅	Δr	CI₉₅
MEN	1,981	.80	[0.78, 0.81]	.31	[0.27, 0.35]	0.45	.80	[0.78, 0.81]	.00	[0.00, 0.00]
MTURK‐771	653	.70	[0.66, 0.74]	.26	[0.19, 0.33]	0.53	.71	[0.67, 0.75]	.01	[0.00, 0.02]
SimLex‐999	913	.45	[0.39, 0.50]	.33	[0.27, 0.39]	0.65	.63	[0.47, 0.56]	.07	[0.04, 0.10]
SimVerb‐3500	2,926	.33	[0.30, 0.36]	.33	[0.30, 0.36]	0.68	.44	[0.41, 0.47]	.11	[0.08, 0.13]
Silberer (Sem.)	5,428	.74	[0.73, 0.75]	.21	[0.19, 0.24]	0.33	.74	[0.73, 0.76]	.00	[0.00, 0.00]
Silberer (Vis.)	5,405	.60	[0.58, 0.61]	.16	[0.14, 0.16]	0.30	.60	[0.58, 0.61]	.00	[0.00, 0.00]
Average		.60		.27			.63		.03

Pearson correlation and confidence intervals for correlation differences Δr between unimodal and multimodal visual models. The top part of the table shows the results for the distributional linguistic model, whereas the bottom part shows the results for the word association model Of the remaining datasets, the Silberer one is most directly relevant to evaluate the role of visual information. The words in this study consisted of concrete nouns taken from the McRae feature generation norms (McRae et al., 2005). However, since words across different categories were compared, the ratings occur between entities specified at different taxonomic levels. This might provide less of a challenge for distribution‐based language models to predict due to large perceptual differences. In contrast to the other studies, two types of ratings were collected—semantic and visual. Semantic ratings are similar to other studies and involve participants judging similarity. Visual ratings, however, only involve judging the similarity of the appearance of the concepts. One might thus expect that models based on visual features will predict visual similarities better than semantic ratings. The results indicate that for the word association model, adding visual information resulted in significant improvement; however, the improvements were small (relative to the distributional model) even for the visual ratings (Δr = .03, CI95 [0.03, 0.04] for the semantic judgments and Δr = .06, CI95 [0.05, 0.07] for the visual judgments). For the distributional linguistic model, adding visual information resulted in an improved prediction for both judgments, especially the visual ones (Δr = .10, CI95 [0.09, 0.10], for the semantic judgments and Δr = .16, CI95 [0.15, 0.17] for the visual judgments). Consistent with both of these sets of findings, the word association model better predicts people's similarity judgments than the distributional linguistic judgments (the difference was .11 CI95 [0.10, 0.12] for the semantic and .14, CI95 [0.13, 0.16] for the visual judgments). To summarize, the contribution of visual information in the multimodal word association model was relatively small compared to the distributional linguistic model. The average gain across datasets was .02 for the former compared to .08 for the latter (see Table 2).

Comparisons at the basic level

The complete Silberer dataset that we evaluated above contains both basic‐level within‐category comparisons like dove—pigeon as well as broader comparisons across categories like dove—butterfly. Since Study 1 involved only basic‐level comparisons within superordinates, in order to provide a better comparison between it and the Silberer data, we did an additional analysis only on the basic‐level terms in the Silberer data. To achieve this, we annotated the Silberer2014 dataset with superordinate labels for common categories taken from Battig and Montague (1969) and De Deyne et al. (2008) such as bird, mammal, musical instrument, vehicle, furniture, and so on. We then restricted the analysis to only the basic‐level terms within those superordinate categories. Words for which no clear common superordinate label (see above) could be assigned were not included. This reduced the number of pairwise comparisons from 5,799 to 1,086, which is still sufficiently large for the current purposes. As the two right panels of Fig. 5 demonstrate, the overall correlations were lower for basic‐level terms than those for the complete set of items. For the word association model, the difference was respectively .19, CI95 [0.16, 0.23] and .21, CI95 [0.17, 0.26]. For the distributional linguistic model, the results were respectively a difference of .27, CI95 [0.22, 0.34] and .24, CI95 [0.19, 0.30]. These results support the idea that compared to mixed lists of items, comparisons at the basic level present a considerable challenge. Moreover, adding visual information results in a larger improvement (relative to the unimodal model) for basic‐level items. For the semantic comparisons using distributional linguistic representations, the multimodal difference was .10 for the full datasets versus .16 for the basic‐level data (see Tables 2 and 3). A bootstrapped test confirmed that the difference between the full and basic‐level correlation was significant, CI95 [−0.03, −0.09].

Fig. 5

Table 3

Replication of Table 2 restricted to basic‐level word pairs in the Silberer dataset

Dataset	n	v_a = Distributional linguistic, v_b = Visual
		Unimodal				Multimodal
		rva	CI₉₅	rvb	CI₉₅	β	rvab	CI₉₅	Δr	CI₉₅
Basic‐Sem.	1,086	.46	[0.41, 0.50]	.56	[0.52, 0.60]	0.55	.61	[0.58, 0.65]	.16	[0.12, 0.19]
Basic‐Vis.	1,086	.35	[0.29, 0.40]	.67	[0.64, 0.70]	0.70	.68	[0.64, 0.71]	.33	[0.28, 0.38]

Results of multimodal models (created by combining either distributional linguistic or word association models with visual features) based on pairwise similarity ratings from the Silberer dataset. The plots show correlations between human judgments and models together with 95% confidence intervals (shaded). The two left panels show the semantic and visual judgments for all items, whereas the two right panels show performance on the subset of basic‐level items in which the word pairs belong to the same superordinate category. Replication of Table 2 restricted to basic‐level word pairs in the Silberer dataset Finally, in line with Study 1, we found that the visual information improved performance more for the distributional linguistic model than the word association model: The difference between Δr = .16 and Δr = .04 was significant, CI95 [0.10, 0.16]. For completeness, Table 3 and Fig. 5 also show the results for the visual judgments. The main finding here is that when comparisons are restricted to visual judgments at the basic level, the unimodal visual model (when β = 1.0) is clearly superior to both the word association ( − = .15, CI95 [0.11, 0.20]) and distributional linguistic ( − = .33, CI95 [0.27, 0.37]) model, in line with earlier argument that basic‐level comparisons rely strongly on visual information.

Evaluation of multimodal affective models

To investigate the effect of affective information, we supplemented the datasets in the previous section with SimVerb‐3500 (Gerz et al., 2016), which contained pairwise similarity ratings for verbs as described before. Most words in SimVerb‐3500 (2,926 out of 3,500) were included in the affective norms of Warriner et al. (2013). The results are shown in Table 4. In contrast to the findings for visual information and the abstract triads in Study 1, most of our datasets show no improvement when affective information is added. The only exceptions are the SimLex‐999 and SimVerb‐3500 datasets, where adding affective information improved the distributional linguistic model, Δr = .07, CI95 [0.04, 0.10] and Δr = .11, CI95 [0.008, 0.13], respectively. Pearson correlation and confidence intervals for correlation differences Δr between unimodal and multimodal affective models. The top part of the table shows the results for the distributional linguistic model, whereas the bottom part shows the results for the word association model Again we find that the word association model provides better estimates of pairwise similarity than the distributional linguistic model, except for the MEN dataset, where they are on par (CI95 [−0.01, 0.02]). The unimodal word association correlation for the additional SimVerb‐3500 dataset, r(2924) = .64, was similar in magnitude to the SimLex‐999 task (r(911) = .68) and was somewhat lower than the other datasets. For the distributional linguistic model, the SimVerb‐3500 correlations was also moderate, r(2924) = .33, which might reflect the difficulty of this model in handling strict similarity and accurately representing verb meaning. To conclude, on average the multimodal affective gain was limited in both the distributional linguistic and word association model (respectively .03 and .02, see Table 4). These results are different from Study 1, but there are at least two reasons why the affective information may have provided less benefit in this study. First, not all datasets here involved comparisons between basic‐level items within a superordinate category. Second, none of the datasets were constructed to investigate abstract words, and it is for these that we might expect affective information to be most important.

The role of affect in abstract words

Unlike the abstract condition in Study 1, some of the pairwise datasets contain both concrete and abstract words. To test whether the presence of concrete words masked the effect of the multimodal affective model, we screened how abstract the stimuli were in each dataset using the concreteness norms from Brysbaert et al. (2014). As in Study 1, we only included similarity judgments for which the average concreteness rating of both words in the pair was smaller than 3.5 on a 5‐point scale. MEN only had 41 pairs (2.1% of words) and no pairs matched in the Silberer dataset. The MTURK‐777, SimLex‐999, and SimVerb‐3500 datasets had a reasonable number of abstract words (respectively 121 or 18.6%, 42.8% or 391 words, and 67.4% or 1,973 words) and for these datasets we tested whether adding affective information results in a larger improvement (relative to the unimodal model) for abstract pairs compared to all pairs. The results are shown in Table 5 and Fig. 6.

Table 5

Replication of the results reported in Table 4 restricted to abstract word pairs in the MTURK‐771, SimLex‐999, and SimVerb‐3500 datasets

Dataset	n	v_a = Distributional Linguistic, v_b = Affect
		Unimodal				Multimodal
		rva	CI₉₅	rvb	CI₉₅	β	rvab	CI₉₅	Δr	CI₉₅
MTURK‐771	121	.67	[0.56, 0.76]	.31	[0.14, 0.46]	0.53	.70	[0.59, 0.78]	.02	[−0.01, 0.06]
SimLex‐999	336	.43	[0.34, 0.52]	.51	[0.42, 0.58]	0.68	.64	[0.58, 0.70]	.21	[0.14, 0.28]
SimVerb‐3500	1,466	.28	[0.23, 0.32]	.40	[0.36, 0.45]	0.73	.47	[0.43, 0.51]	.19	[0.15, 0.23]
Average		.46		.41			.60		.14

Fig. 6

Investigation of the role of affective information for abstract words comparing the full set (all pairs) with a subset of abstract words taken from the SimLex‐999 and SimVerb‐3500 datasets. Results qualitatively replicate the finding in Study 1 that adding affective information improves the performance of the distributional linguistic model but not the word association model when the abstractness of the words is considered.

For the word association model, the multimodal gain for all pairs was Δr = .00 for MTURK‐771 and Δr = .01 for both SimLex‐999 and SimVerb‐3500 (see Table 4). These values were smaller than the corresponding Δrs of .01, .06, and .04 for the subset of abstract items of the respective datasets (see Table 5). A bootstrapped test confirmed that the difference between the correlation for all versus abstract items using the word association model was significant for both SimLex‐999, CI95 [−0.07, −0.02] and SimVerb‐3400, CI95 [−.02, −.00] but not for MTURK‐771, CI95 [−0.02, 0.01]. Replication of the results reported in Table 4 restricted to abstract word pairs in the MTURK‐771, SimLex‐999, and SimVerb‐3500 datasets For the distributional linguistic model, the multimodal gain was Δr = .01 for MTURK‐771, Δr = .07 for SimLex‐999, and Δr = .11 for SimVerb‐3500 for all pairs (see Table 4). These values were smaller than the corresponding Δrs of .02, .21, and .19 for the subset of abstract items of the respective datasets (see Table 5). A bootstrapped test confirmed that the difference between the correlation for all versus abstract pairs was significant for both SimLex‐999, CI95 [−0.03, −0.09] and SimVerb‐3500, CI95 [−0.03, −0.09], but not for MTURK‐771, CI95 [−0.03, 0.02]. Finally, in line with Study 1, we found that the affective information improved performance more for the distributional linguistic model than the word association model in all datasets but MTURK‐771. For SimLex‐999 the difference between Δr = .06 (associations) and Δr = .21 (distributional linguistic) was significant, CI95 [0.08, 0.21], and so was the difference for SimVerb‐3500, Δr = .04 (associations) and Δr = .19, CI95 [0.11, 0.19] (distributional linguistic). So far, the role of affective multimodal information in abstract words was consistent with the predicted pattern of a relatively larger gain for the distributional linguistic model. However, the current findings for MTURK‐771 indicate only a small multimodal gain. While the number of abstract stimuli was not only smaller in MTURK‐771 compared to the other two datasets, further inspection also showed that the stimuli might have been less varied. If the pairs are more neutral in terms of affect, then this might account for the relatively small gain in the multimodal affective model. To verify whether this was the case, we compared the distribution of valence in Warriner et al. (2013) with the three datasets. For all 13,915 normed words in the Warriner norms, the 9‐point scale valence ratings quantiles were 1.26 (0%), 4.25 (25%), 5.20 (50%), 5.95 (75%), and 8.53 (100%). For MTURK‐771, 91.7% of words fell in the middle range (25%–75%), whereas for Simlex‐999 and SimVerb‐3500 this was respectively 65.4% and 44.8% for SimVerb‐3500. This suggests that the relative multimodal gain in MTURK‐771 might be due to a restricted subset of abstract items in terms of emotion. In line with the averages across datasets in Table 6, we conclude that the role of affective information in multimodal models was consistent with Study 1 and extends its results to a different task and a more varied set of words, including verbs.

Table 6

Silberer2014 (Semantic judgments)
Study	v_a description	v_b description	rva	rvb	rvab
Silberer et al. (2013)	feature ratings	visual attributes ImageNet	.71	.49	.68
Lazaridou et al. (2015)	word2vec	CNN‐features	.62	.55	.72
De Deyne, Navarro, Collell and Perfors (2020)	word2vec	CNN‐features	.73	.78	.82
De Deyne et al. (2020)	word associations	CNN‐features	.84	.78	.87

Comparison between Study 2 results and previously published studies. The correlations reported are between the human ratings and either the unimodal distributional linguistic model in question (v), the visual model in that study (v), and the multimodal model that combines both v and v (v) Investigation of the role of affective information for abstract words comparing the full set (all pairs) with a subset of abstract words taken from the SimLex‐999 and SimVerb‐3500 datasets. Results qualitatively replicate the finding in Study 1 that adding affective information improves the performance of the distributional linguistic model but not the word association model when the abstractness of the words is considered. As in Study 1, we here perform a series of additional analyses to evaluate the extent to which our results depend on specific modeling choices. We focused on alternative distributional linguistic models and report the results for GloVe trained on 6 billion (GloVe‐6B) and 840 billion GloVe‐840B tokens. If the results from Study 1 replicate, we would expect performance to be similar to the distributional linguistic model based on word2vec, except when the corpus size is extremely large. In that case, we would expect performance to be improved for comparisons that involve more visual categories. The results, shown in Fig. 7A, are consistent with this: The overall correlations using the 6 billion word GloVe‐based linguistic model (r = .60 on average) and the one using 840 billion words (r = .66 on average) were similar to those obtained using the word2vec‐based one (r = .64 on average). When visual information was added, correlations were virtually identical for all three distributional linguistic models (between r = .73 and r = .74).

Fig. 7

A comparison of the unimodal and multimodal model performance in three different distributional linguistic models: word2vec and GloVe‐6B (6 billion words) and GloVe‐840B (840 billion words). Multimodal models reflect the optimized Pearson correlation for mixing parameters β. Top panels show Pearson correlations and 95% confidence intervals for the visual multimodal models (Panel A) and affective multimodal models (Panel B). The bottom panels show the findings when considering a subset of concrete word pairs at the basic level (Panel C) and abstract affective words (Panel D). In Study 1, the GloVe‐based models had the best performance for concrete words when based on extremely large corpora (840B words), which suggested that the improved quality of the model primarily reflected taking advantage of the information contained in such large texts rather than the specific embedding technique or parameters. This is consistent with previous research indicating that GloVe improves with size (Pennington et al., 2014). Here as well, we find that comparisons at the basic level, which presumably encode more perceptual properties, were better predicted with a larger corpus (see Fig. 7C). In line with Study 1, the distributional linguistic model based on the larger corpus did not lead to substantial improvements for abstract concepts (Fig. 7D) compared to the 6B corpus model. Regardless of the model, correlations improved markedly when affective information was included (Δr between .18 and .19 for all three models). This is consistent with our initial findings of Study 1 (Δr = .13) suggesting that distributional linguistic models specifically lack the affective information that is important for representing abstract concepts.

Comparison to previous work

Because the datasets in Study 2 have appeared in several recent studies that investigated multimodal visual models, we can compare between our results and theirs. An overview of the results is shown in Table 6. For the purpose of current comparisons, we focus on the findings reported in four studies: Silberer et al. (2013), Bruni et al. (2014), Lazaridou et al. (2015), and Kiela and Bottou (2014). Each of these studies compared both unimodal distributional linguistic and visual feature models with multimodal models that combined both types of information. In the study by Silberer et al. (2013), multimodal representations were derived by combining a distributional model induced from human feature generation norms with a visual model that was trained to predict visual attributes for the corresponding images in ImageNet. The feature norms are of direct interest because they provide an alternative experimental measure to contrast with the word association model. Feature norms allow more precise propositional statements that capture something about the semantic relationships (a duck bill, bird, etc.). Moreover, in contrast to word associations, the instructions in feature‐rating tasks appeal to core objective properties that define meaning and exclude affective and attitudinal aspects of meaning (De Deyne, Verheyen, Navarro, Perfors, & Storms, 2015). The second study by Bruni et al. (2014) consists of a count‐based linguistic model using a sliding window and a large text corpus constructed from Wikipedia and other material taken from the internet. For the visual information, a scale‐invariant image features transformation (SIFT) was used to extract features, which were then treated as visual “words” in a bag‐of‐words representation. This bag‐of‐visual words approach was applied to an image corpus of 100K labeled images derived from the ESP‐Game data set (Von Ahn, 2006). The third study by Kiela and Bottou (2014) used a distributional linguistic model that was trained on Wikipedia and relied on the skip‐gram word2vec model. Two different methods were considered for the visual features: One consisted of a bag‐of‐visual words approach with SIFT features, and the other used a CNN to extract features from ImageNet. The last study by Lazaridou et al. (2015) provides an even closer comparison to the current study since it uses a CNN approach for the visual features and skip‐gram for the distributional linguistic representations. Although their linguistic model is unlikely to reflect typical language‐exposure—it is based on the entire English Wikipedia—it provides an approximation of what can be encoded through a slightly less natural language register. As can be seen from Table 6, our results are competitive for both the unimodal distributional linguistic and visual feature models as well as the multimodal combination of both models. We obtained correlations as high and often higher than previous work. While we cannot rule out the possibility that a distributional linguistic model with different parameters trained on the same input might provide higher correlations and potentially reduce the multimodal benefit, the overall pattern of results is fairly robust to choices of parameters, distributional linguistic model, and corpus. Finally, in the case of word associations, the correlations are higher than studies that relied on feature ratings used by Silberer et al. (2013), which are considered a gold standard.

General discussion

In two studies, we investigated how linguistic and multimodal representations of meaning that include experiential features (visual and affective) can capture human semantic judgments of abstract and concrete concepts. In both studies, we find that multimodal representations, which combine linguistic with affective or visual information, better predict human similarity judgments. Our findings replicate and extend previous work addressing visual multimodal models (Bruni et al., 2014; Kiela & Bottou, 2014; Lazaridou et al., 2015; Silberer & Lapata, 2014) and identified a novel and substantial effect of affective information for abstract concepts. Our results also identify systematic factors that determine to what degree experiential information improves on the performance of unimodal models.

What factors determine the performance in multimodal models?

A first factor that determines the degree to which multimodal models improve over unimodal ones is the nature of the experiential modality. In this study, we focused on visual and affective experiential information. In a series of original experiments, we found that adding experiential information to a distributional linguistic model improved its correlation with human ratings by .12 for concrete words (for visual information) and .13 for abstract words (for affective information). In Study 2, these findings were replicated using an extensive set of previously published similarity ratings. These results indicate that the need to supplement distributional linguistic models with affective information is especially important for abstract concepts. The second factor is the level of comparison: The performance of the models is quite comparable when the semantic relationship extends beyond the basic level, but becomes more differentiated at the basic level. This suggests a potential shortcoming of existing benchmarks: Since they evaluate concepts across taxonomic levels, they might underestimate the relative improvement offered by multimodal models over unimodal ones. Indeed, when we reanalyzed the largest available dataset (Silberer & Lapata, 2014) from Study 2 to focus only on comparisons between basic‐level concepts, the visual multimodal model showed much larger improvements than when the full item set was included. A third factor is the nature or modality of the experiential models. The current results hinge on the visual and affective representations we derived. Given the recency of the models we used, further improvements are to be expected, which could indicate that perceptual and affective information is even more important than that estimated here. For example, in the case of affect, the Osgood model based on three dimensions remains somewhat coarse due to its low‐dimensional nature. Instead of three factors, richer representations of affect have also been proposed. This includes properties about social interactions (Barsalou & Wiemer‐Hastings, 2005), morality (Moll, Zahn, de Oliveira‐Souza, Krueger, & Grafman, 2005), and emotions. For example, Ekman (1992) distinguishes six basic emotions (joy, sadness, anger, fear, disgust, and surprise), whereas Plutchik (1994) extends this list with trust and anticipation as well. Some of these emotional features were included in recent studies to map the meaning of abstract words. One example is the work by Crutch, Troche, Reilly, and Ridgway (2013) in which an abstract conceptual feature rating task was used where participants judged abstract words on nine cognitive dimensions using a Likert‐like scale. To explore properties beyond core affect, we ran a preliminary analysis to investigate this possibility using the NRC Emotion lexicon, which contains judgments for the Ekman emotions for over 14,000 English words (Mohammad & Turney, 2013). We found very limited evidence for any contribution of emotions above that of affect: They did not capture the similarities derived from Study 1 or the rated similarity datasets in Study 2 as well, despite having more features. Finally, if we look at the absolute performance, we see that the correlations are high for some datasets, suggesting that room for improvement is somewhat limited. A fourth factor that needs to be considered is the nature of the distributional linguistic model. Our work relies on the assumption that the current models are reasonable approximations of what meaning can be derived from text‐based linguistic corpora. We derived these models according to both theoretical considerations (are they appropriate given the words a human knows and is exposed to across the lifespan?) as well as empirical ones (are the models on par with those reported in the literature?). It remains possible that better distributional linguistic models might capture more visual or affective information, which would reduce the difference with a multimodal model. However, in both Study 1 and Study 2, our results for the most part did not depend on the size of the corpus or the way the embeddings were obtained; the only (relatively minor) improvements occurred when using extremely large corpora outstretching human information processing capacities. Furthermore, comparing our results to previous work suggests that both our distributional linguistic models and the visual feature models are representative of the current state of the art. Finally, there are a host of task and other stimuli factors that may have influenced our results and could be considered in future work. For example, part of speech can strongly determine how well different models perform: We observed large differences between word association and distributional linguistic model performance based on whether the items were verbs or not (see Table 4). However, it is unlikely that our main findings are an artifact of the procedure or the specific stimuli. Study 2 demonstrated that the same qualitative patterns emerge when considering different tasks, stimuli, and procedures. This includes similarity judgments in triadic comparisons for basic‐level categories, relatedness judgments for a variety of semantic relations, strict similarity judgments in which participants were to ignore any kind of relatedness, and visual judgments instead of semantic judgments.

Implications for distributional accounts based on word co‐occurrences

In recent years, a new generation of lexico‐semantic word co‐occurrence models based on word embeddings trained to predict words in context has been proposed as an alternative to earlier models that simply count word co‐occurrences. The improvements from prediction models are considered groundbreaking in how well they account for behavioral measures such as similarity (Baroni et al., 2014; Mandera et al., 2017). The current results might temper such conclusions as more stringent tests show that even these prediction models only partially capture meaning. We suggest that many of the previous evaluation tasks in the literature do not rely on visual or affective information, since they do not focus on the basic objects belonging to the same superordinate category (where visual information is more relevant) or abstract categories (where affective information is more relevant). In those cases, no advantage for experiential information would be found, but that reflects the test items and not how people represent words in general. The limitations of distributional linguistic models are evident when considering abstract concepts as well. The fact that affective representations consisting of nine features provide a better account of abstract words in two large datasets (SimLex‐999 and SimVerb‐3500) raises questions about the extent to which the meaning of abstract words is really fully captured by word co‐occurrence models. This conclusion is consistent with other work showing that distributional linguistic models only capture a fraction of the variance in a task in which people are asked to judge unrelated words (De Deyne, Navarro, et al., 2016; De Deyne, Perfors, et al., 2016). Altogether, these results suggest that previously obtained high correlations between predictions derived from distributional linguistic models and similarity ratings may have been to some extent an artifact of the comparisons being evaluated. In particular, the lack of abstract concepts and verbs, as well as the limited number of basic‐level comparisons for the mostly concrete concepts, might have inflated the performance of many of these models to a point where only ceiling effects were obtained. In those cases, the efficacy of language‐only models may have been exaggerated.

Implications for models based on word associations

To provide context to the results obtained from the distributional linguistic model, we used a model derived from word associations as a baseline measure. One of our most consistent findings was that models using word associations resulted in correlations close to the maximum allowed by the reliability of the similarity ratings. This is remarkable because the comparisons at the basic level may especially rely on experiential visual or affective information: Concrete words rely on perceptual features and abstract words incorporate many affective properties. Despite this, providing additional visual or affective information improved the results of the word association model only slightly. This contrasts with the distributional linguistic models, which benefited far more clearly from additional visual and affective information. Our results showing that visual and affective properties for categories like fruit, tools, feelings, or states are sufficiently encoded in models built from word associations provides further evidence that word associations are not restricted to the language modality but multimodal in the sense that the “first word that comes to mind” relies on imagery and the recollection of affective states as well. Instead of treating word associations merely as a dependent variable, the current work is an example of how a dialectic approach (see Taylor, 2012) that contrasts text‐based distributional linguistic models with word‐association‐based models can be used to get a deeper insight into the nature of the underlying representations. The finding that experiential information is encoded in word associations challenges the tacit assumption that word associations can be accurately predicted from language. As a consequence, if natural language and word associations capture different aspects of meaning, it would be expected that the correlations between them would be moderate at best. The most recent evidence along these lines comes from a study by Nematzadeh, Meylan, and Griffiths (2017) that compared topic and word embedding models with word associations. The best performing model correlated .27 with associative strength. This weak correlation may reflect the fact that word associations tap directly into experiential types of representations. This lack of experiential grounding could also explain why text models struggle to predict the color of objects (Bruni, Boleda, Baroni, & Tran, 2012) and do not capture the association between concrete concepts and their typical attributes (Baroni & Lenci, 2008), both of which are often among the strongest associates to a cue word.

The contribution of affect to the meaning of abstract words

Consistent with our results, previous research suggests that multimodal models consistently improve performance relative to unimodal ones (e.g., Bruni et al., 2014). However, the gain was especially large for the affective multimodal model account of abstract concepts, even though the underlying representation consisted of a handful of features. Indeed, this simple affective model—consisting of only nine features by itself—provided a better prediction for abstract concepts than linguistic models in both studies (see Tables 1 and 5). This contribution of affective information is difficult to explain given that it is often assumed that abstract concepts rely only on a verbal code to represent their meaning, whereas concrete words rely on an imagery code as well (Paivio, 1971). If abstract words are predominantly acquired through language exposure, distributional linguistic models should correlate strongly with human judgments of abstract words, and potentially not as strongly with judgments of concrete words. This advantage for abstract words was not supported in Study 1 or in Study2. An outstanding issue is what sort of learning is required to acquire the core affective properties of words. To what extent is nonverbal or embodied information necessary? One might expect that language should a priori play an important role because affect and emotions play a social role in communication. However, affect can be inferred from a range of signals. Nonverbally, affect is communicated through facial expressions (Cacioppo, Berntson, Larsen, Poehlmann, & Ito, 2000), and these expressions correspond to an internal (embodied) state. Indeed, research suggests that nonlinguistic cues from facial expressions provide information about core affect (valence and arousal), and might also capture emotions like anger or fear when the context supports this (Barrett & Bliss‐Moreau, 2009). Besides facial expressions, affect might also be communicated through auditory aspects of spoken language. The tone of voice and other acoustic cues contribute to the affective state of the listener (Nygaard & Lunders, 2002) and lead to altered lexical processing (Schirmer & Kotz, 2003). Language also provides useful cues about valence in the form of the word; affective congruence between sound and meaning leads to a processing advantage in word recognition (Aryani & Jacobs, 2018). While all of these factors are likely to contribute to learning affect and the meaning of abstract words, it is unlikely that any factor in itself is sufficient. Consider, for example, factors such as facial expressions or tone. For many words, it is not directly clear how acoustic or nonverbal information would provide sufficient affective grounding, as many abstract words are acquired through written language only. Estimates by Brysbaert, Stevens, Mandera, and Keuleers (2016) underline this point: A 20‐year‐old exclusively exposed to social interaction would have encountered about 81,000 word types, whereas a 20‐year‐old exclusively exposed to text will have encountered 292,000 types. Of course, when visual or acoustic cues are absent, the acquisition of affective concepts can still rely on embodied processes involving empathy, where people put themselves in someone else's situation and imagine how they would feel. This view would be mostly consistent with the AEA (Kousta et al., 2009) as well. Thus far, we have remained vague about whether affect extends beyond abstract concepts and plays a role in certain concrete concepts as well. Partly, this reflects the fact that investigations about affective grounding are relatively new. The current findings suggest that the correlations between human similarity judgments and affective multimodal models are much smaller for concrete than abstract words. This could be due to the fact that we used a subset of mostly concrete nouns, whereas large effects for valence and arousal have been found for adjective representations (De Deyne et al., 2014). Potentially, many adjectives have more extreme valence or arousal values than most concrete nouns. A second explanation might be that there is a trade‐off in accessing the meaning of concrete nouns. According to this scenario, affective information would be an integral part of the meaning of these words, but information from other modalities dominates. Affective information is logically more salient in abstract concepts due to the limited contribution of visual and other perceptual modalities. All this suggests a few interesting avenues in which emotional‐laden concrete words and abstract words might be compared in future research.

34 in total

Review 1. Perceptual inference through global lexical similarity.

Authors: Brendan T Johns; Michael N Jones
Journal: Top Cogn Sci Date: 2012-01

Review 2. Language, embodiment, and the cognitive niche.

Authors: Andy Clark
Journal: Trends Cogn Sci Date: 2006-07-14 Impact factor: 20.229

3. fMRI evidence for word association and situated simulation in conceptual processing.

Authors: W Kyle Simmons; Stephan B Hamann; Carla L Harenski; Xiaoping P Hu; Lawrence W Barsalou
Journal: J Physiol Paris Date: 2008-04-01

Review 4. The challenge of abstract concepts.

Authors: Anna M Borghi; Ferdinand Binkofski; Cristiano Castelfranchi; Felice Cimatti; Claudia Scorolli; Luca Tummolini
Journal: Psychol Bull Date: 2017-01-16 Impact factor: 17.737

Review 3. Using Network Science to Understand the Aging Lexicon: Linking Individuals' Experience, Semantic Networks, and Cognitive Performance.

Authors: Dirk U Wulff; Simon De Deyne; Samuel Aeschbach; Rui Mata
Journal: Top Cogn Sci Date: 2022-01-18

3 in total

Introduction

Multimodal representations of concrete concepts

Multimodal representations of abstract concepts

Current study

Comparisons between basic‐level concepts within a superordinate category

Comparison of distributional linguistic models to a word association baseline

Constructing unimodal and multimodal models

Distributional linguistic model

Corpus

Word2vec embeddings

Word association model

Word association data

Semantic network

Visual feature model

Affective feature model

Multimodal models that combine linguistic models with experiential representations

Study 1: Basic‐level triadic comparisons

Method

Participants

Stimuli

Procedure

Behavioral results

Model evaluation

Distributional linguistic versus word association model comparisons

Visual and affective multimodal model comparisons

Robustness

Distributional linguistic models

Word association model

Study 2: Pairwise similarity experiments

Datasets

MEN

MTURK‐771

SimLex‐999

Silberer

SimVerb‐3500

Evaluation of multimodal visual models

Datasets involving diverse semantic comparisons

Comparisons at the basic level

Evaluation of multimodal affective models

The role of affect in abstract words

Comparison to previous work

General discussion

What factors determine the performance in multimodal models?

Implications for distributional accounts based on word co‐occurrences

Implications for models based on word associations

The contribution of affect to the meaning of abstract words

Review 1. Perceptual inference through global lexical similarity.

Review 2. Language, embodiment, and the cognitive niche.

Review 4. The challenge of abstract concepts.

Review 6. Symbol interdependency in symbolic and embodied cognition.

Review 7. Opinion: the neural basis of human moral cognition.

Review 3. Using Network Science to Understand the Aging Lexicon: Linking Individuals' Experience, Semantic Networks, and Cognitive Performance.