Materials with complex appearances, like textiles and foodstuffs, pose challenges for conventional theories of vision. But recent advances in unsupervised deep learning provide a framework for explaining how we learn to see them. We suggest that perception does not involve estimating physical quantities like reflectance or lighting. Instead, representations emerge from learning to encode and predict the visual input as efficiently and accurately as possible. Neural networks can be trained to compress natural images or to predict frames in movies without 'ground truth' data about the outside world. Yet, to succeed, such systems may automatically discover how to disentangle distal causal factors. Such 'statistical appearance models' potentially provide a coherent explanation of both failures and successes in perception.
Materials with complex appearances, like textiles and foodstuffs, pose challenges for conventional theories of vision. But recent advances in unsupervised deep learning provide a framework for explaining how we learn to see them. We suggest that perception does not involve estimating physical quantities like reflectance or lighting. Instead, representations emerge from learning to encode and predict the visual input as efficiently and accurately as possible. Neural networks can be trained to compress natural images or to predict frames in movies without 'ground truth' data about the outside world. Yet, to succeed, such systems may automatically discover how to disentangle distal causal factors. Such 'statistical appearance models' potentially provide a coherent explanation of both failures and successes in perception.
Materials such as tweed, leather or scrambled eggs have richly detailed visual appearances (Figure 1a). When we view such materials, we enjoy a vivid impression of their characteristics, such as how they would feel if touched [1, 2, 3,4,5]. Yet, due to their physical complexity, they pose profound challenges for traditional ‘inverse optics’ theories of perception [6, 7, 8,9,77]. Most theories assume the brain’s goal is to estimate physical quantities, like surface reflectance, orientation or depths [10, 11, 12]. Yet when we perceive complex materials, what exactly is the brain ‘estimating’? Many visual properties—such as how faded denim appears, the ripeness of a pear, or the gracefulness of a ballet dancer—are hard to define in physical terms (cf. ‘tertiary properties’, [13,14]; or ‘affordances’, [15]). Moreover, we cannot be born knowing all these properties and how to infer them—denim and ballet dancers did not exist during evolution. Instead, their appearance characteristics must somehow be learned. For properties like these, we must not only learn how to estimate distal properties from image data, but also what to estimate in the first place.
Figure 1
Learning to see stuff.
(a) Substances such as tweed, leather, and scrambled eggs evoke rich material impressions. (b) Physical parameters (here, azimuth and elevation angle) determine the retinal image (‘forward optics’). Neighbouring physical parameters can give rise to wildly different images (the tangled pink grid), and most possible images look like meaningless noise (cyan dots). Unsupervised learning can discover ‘statistical appearance models’, comprising latent variables that efficiently capture the variation among natural images. (c) Deep neural networks can learn powerful latent codes capturing natural image variations. After training to encode 70 000 real human faces from the FFHQ dataset ([75]; https://github.com/NVlabs/ffhq-dataset; images are public domain as defined under creative commons CC0 1.0 license), a network was able to generate completely novel face images such as the nine shown, which do not correspond to any existing person (generated by Jordan Suchow using the PixelVAE network described in Ref. [29]).
Learning to see stuff.(a) Substances such as tweed, leather, and scrambled eggs evoke rich material impressions. (b) Physical parameters (here, azimuth and elevation angle) determine the retinal image (‘forward optics’). Neighbouring physical parameters can give rise to wildly different images (the tangled pink grid), and most possible images look like meaningless noise (cyan dots). Unsupervised learning can discover ‘statistical appearance models’, comprising latent variables that efficiently capture the variation among natural images. (c) Deep neural networks can learn powerful latent codes capturing natural image variations. After training to encode 70 000 real human faces from the FFHQ dataset ([75]; https://github.com/NVlabs/ffhq-dataset; images are public domain as defined under creative commons CC0 1.0 license), a network was able to generate completely novel face images such as the nine shown, which do not correspond to any existing person (generated by Jordan Suchow using the PixelVAE network described in Ref. [29]).This leads to a fundamental question. How do we learn to see the outside world? It cannot be primarily through supervised learning because we never get detailed information about the true state of the world. Most points in the visual field are beyond reach, and we cannot feel, taste or smell the colour of a surface. Indeed, all sensory signals are highly ambiguous, so no one sensory modality can provide the ground truth for the others. Although motor actions allow us to probe the world to learn about how it behaves, we can only ever detect the effects of our actions via the senses. Thus, learning how to see the outside world must somehow proceed without explicit labelled training data (or at best exceedingly sparse data; see also [16] for related arguments). Together, these considerations indicate we need an alternative formulation of vision that goes beyond ‘inverse optics’.We suggest that perception of complex material and object properties does not arise primarily through densely supervised learning, nor indeed through estimating predefined physical quantities. Rather, perceptual representations emerge through learning to encode and predict the visual input as accurately and efficiently as possible. This may seem like a paradoxical claim, yet we propose that the best way to learn how to infer the distal stimulus (i.e. properties of the outside world) is to get really good at describing the proximal stimulus. Recent advances in unsupervised deep learning (see Box 1: The Modern Deep Learning Framework) provide a powerful framework for implementing and testing this conjecture.Deep learning is machine learning using deep neural networks. Neural networks are computer models consisting of many interconnected neuron-like units, usually arranged in processing stages or layers. Each unit combines and non-linearly transforms its input signals to produce a numerical output. Deep neural networks (DNNs) consist of multiple layers, allowing a series of intermediate representations between the network’s input and output. The transition from shallow to deeper networks has enabled ground-breaking progress in simulating human-like perceptual [43,44], cognitive [45,46,47] and linguistic [48,49] abilities, and provides a promising modelling approach in perceptual and cognitive neuroscience [50,51].Like brains, neural networks need to learn how to perform the tasks that are required of them. Knowledge is embodied in the weights with which each unit combines its inputs, which are initially random. The weights are then incrementally updated via a learning algorithm, such as backpropagation, which adjusts the network’s parameters to fulfil a specific objective function—typically, minimizing error on a particular task.In supervised learning, the network’s weights are adjusted to bring the outputs for the training inputs (e.g. photos of objects) closer to desired outputs (e.g. corresponding object names). Thus, supervised learning involves labelled training data (e.g., [79]). After training, the network can generalise by assigning appropriate outputs to novel inputs (e.g. recognise an object in a photo it has not seen before).In unsupervised learning, the network learns to capture high-order statistics of its training inputs, rather than return specific desired outputs. For example, an autoencoder network learns to compress high-dimensional input data within a lower-dimensional ‘latent code’ layer, in such a way that allows it to reconstruct the original input with minimal distortion (Figure 2). Unsupervised learning signals are generally richer than supervised ones; an image autoencoder derives a training signal for every pixel of its attempted reconstruction, whereas a supervised object-recognition network might receive only one object label per image. Unsupervised models therefore attempt to learn all regularities in their data, not only those relevant for a predefined task. This makes unsupervised learning a good candidate method through which to learn visual representations upon which many natural tasks can be performed.
Figure 2
Unsupervised image compression can discover natural material types.
Top right: schematic of an autoencoder network trained on images of natural textures. Images are passed through four convolutional layers with successively fewer units, before being expanded back to the original dimensionality. The learning objective is to minimise the pixelwise difference between original and reconstructed images. Bottom left: by applying the dimensionality reduction method tSNE [76] to 3000 images depicting fur, gravel, or wool, we see that these categories are highly intermixed in image space. The tSNE algorithm embeds high-dimensional data into two dimensions for visualisation, while preserving local distances between nearby points as faithfully as possible. Bottom right: when the same algorithm is applied to the representations of the images within the trained autoencoder’s latent code, strong clusters emerge corresponding to the natural material types.
Alt-text: Box 1
Statistical appearance models
In contrast to inverse optics, we suggest that through the vast visual diet of our infancy, we learn to parse visual experience in some less physically principled but more ecologically feasible way. In support of this, psychophysical data suggest material perception is often best explained by recourse to features in images, rather than ground-truth physical properties [17, 18, 19,78]. For example, in gloss perception, variations of shape and lighting cause surfaces with identical reflectances to appear differently glossy (Marlow et al., 2012; [18]). Such illusions are difficult to explain if the goal is to recover physical surface reflectance. Yet, image features like the size, contrast and sharpness of highlights can well predict the ‘erroneous’ gloss judgments. This suggests that visual processes seek to capture and parameterise statistical variations in the proximal image data, rather than estimate distal scene parameters per se.Specifically, rather than learning mappings between image quantities (‘cues’) and physical quantities, we learn to represent the dimensions of variation within and among natural images, which in turn arise from the systematic effects that distal properties have on the image. For example, a salient difference between images of surfaces with different reflectance properties is that the size, contrast and sharpness of highlights tend to vary. Thus, the visual system learns to separate surfaces with low-contrast bright blotches, from those with high-contrast blotches. All other things being equal, this is a valid way of distinguishing low-gloss from high-gloss materials. Importantly, however, these systematic variations in highlights can be discovered just by observing images, without knowing a priori that there is a distal factor—specular reflectance—that is responsible for the variations. The discovered dimensions of variation might sometimes roughly align with such physical factors, but may also combine and conflate several physical parameters, leading to what seem like ‘illusions’ from an inverse optics perspective. We call internal representations of the ways images vary statistical appearance models [4,5]. We suggest such internal models provide an efficient and robust representation, on the basis of which many different estimation tasks can be performed.This concept of statistical appearance models is somewhat abstract. How, in practice, can the brain learn the ‘natural degrees of variation’ between images? Deep learning provides a rigorous means to implement this idea in image-computable form, and to compare such models to human judgments. To appreciate why, it is useful to consider the statistical distribution of natural images.
Learning about real-world images…
Representing and distinguishing between the images we are likely to experience poses a challenging statistical problem (Figure 1b). To be efficient, visual representations should span the tiny subspace occupied by real-world images—capturing all their possible variations—but not too much more.Consider a visual ‘world’ of 100 × 100 pixel colour images (typical for website thumbnails). We can usually recognise the content of such images, yet there are 30 000 dimensions of possible variation (one for each pixel/channel). Importantly, however, only a tiny proportion of the images in this space represent plausible scenes and objects from the real world. This is because the physical and optical generative processes that create natural images give rise to statistical dependencies across pixels. When three-dimensional objects are illuminated and projected onto the retina they yield images with high-order correlations between pixels.Because most images in the space are highly unlikely as real images, brains need not encode them in a way that allows us to easily differentiate among them. Indeed, almost all possible images look like near-indistinguishable random noise to human observers. This means that the brain can use a lower-dimensional ‘latent code’ to represent real images more efficiently.There is a long history of posing sensory processing in terms of efficient coding theory. Since Attneave [20] and Barlow [21] there has been a prominent idea that neural response properties are determined by the goal of efficiently encoding their input data [22, 23, 24]. This has been quite successful at explaining aspects of low-level vision, but until very recently it has not been a feasible approach to higher-level perception of objects and materials.
… discovers latent variables
The key insight that allows us to bridge efficient coding and high-level vision is as follows. Natural images derive their structure from all the generative processes of the natural world: everything from the laws of physics—perspective projection and specular reflection—to the fact that faces have two eyes and bicycles two wheels. The best way to get really efficient and accurate at representing the set of natural images, is to discover latent variables that structure those images. Not in terms of physical laws, but in terms of statistical relationships between elements in the image.It is more efficient to represent images in terms of latent variables because they describe variations in a much more compact code. For example, viewing an object from different directions generates widely varying images (see Figure 1b). But all these images occupy a 2D manifold, as all possible variations among them can be summarised with just two parameters (uniquely specifying the viewing angle), given a particular object and a fixed viewing distance. A compact representation that specifies relationships between these images in terms of two numbers has learnt something important about the outside world, even if those two latent variables do not correspond exactly to azimuth and elevation.Thus, discovering latent variables does not necessarily require densely supervised learning, where the visual system is taught explicit mappings between image cues and physical properties. It can also be achieved through unsupervised learning, in which a system discovers regularities in its input data by itself. Through an objective function that seeks to capture the variations in proximal image data as well as possible, we may end up with internal representations that are well suited for describing the distal scene factors that created those images. Although the latent variables discovered by unsupervised learning may not correspond perfectly to the true physical factors of the world, they may provide the basis for the perceptual dimensions that emerge when observers are asked to perform specific tasks (e.g. gloss judgements). Learning such representations requires the inference powers of deep learning... as well as lots and lots of training data.
Case studies in unsupervised visual learning
To acquire statistical appearance models, we need a learning framework that knows nothing about the outside world but gets to observe (potentially many) samples drawn from it. Here we highlight a few promising implementations of unsupervised learning in deep neural networks (although, see also Box 2: Caveats and Open Challenges).We have argued that unsupervised learning is an important part of how our rich perceptual impressions emerge, but it is likely not the full story of how we see.Not all visual competences are learned within a single lifetime. Humans can discriminate between simple visual patterns at birth [52, 53, 54] and perhaps even before [55], and fundamental elements of spatial vision, such as the ability to segment objects in depth, may be innate [56]. Adult visual abilities combine those ‘baked in’ by evolution with those learned from experience. Even before birth, spontaneous neural activity may help structure our visual systems via unsupervised learning rules [57,58]. Functional specialisation represents another challenge for theories-based purely on learning [59].We have concentrated on learning through passive visual observation, but there is evidence that active exploration facilitates visual development [42,60, 61, 62, 63]. However, many of our visual abilities are developed before motor control allows us to make precise modifications to the world. Moreover, if vision could not be learnt without action, then congenital tetraplegics—who can barely alter the world through actions during development—would have devastating perceptual deficits, whereas they are actually mild [64]. This suggests that while motor control refines our visual abilities, it is not a sine qua non for seeing the outside world.Reinforcement is another learning objective that does not require access to ground-truth-labelled data. During training, models learn to output actions (e.g. movements) that yield rewards. This approach provides sparser training signals than objectives like reconstruction or prediction, since rewards are relatively scarce, yet can still lead to rich perceptual representations. Recent successes, for example in the domain of video game playing, point to its potential power [45,65]. Evolution may also be thought of as a type of reinforcement learning.Concerns are often raised about the power of neural networks as explanatory models [66, 67, 68]. Yet, like animal models of psychiatric disorders, they provide a useful experimental platform. Neural networks should not be thought of as black boxes, as researchers have been able to discover much about what is represented within the latent codes of trained networks (e.g. [32,35]), and use them to make quantitative predictions for perception of novel stimuli (e.g. [33,69]). Image-computable models can be compared to biological visual systems at many levels of abstraction, from predicting neural activity to behaviour. A good model should exhibit detailed patterns of behaviour (e.g. errors, response times, and sensitivity to specific stimulus manipulations) similar to those found in humans [51,70,71,80]. However, behavioural similarity is necessary but not sufficient to show that a model does visual tasks in the same way brains do. Further refining methods for interrogating, distilling and interpreting the computational strategies learned by networks is an ongoing challenge [72, 73, 74].Alt-text: Box 2
Data compression
One potentially important objective is to encode images as compactly as possible. Autoencoders [25] are feedforward networks that reconstruct inputs after compressing them via a ‘bottleneck’ consisting of many fewer units than their inputs (Figure 2). Because of the bottleneck, they must learn a low-dimensional representation from which the originals can still be accurately reconstructed. As a result, they tend to discover latent variables that are good at capturing complex statistical variations across images. This can allow them to disentangle distinct causal contributions to observed data. In Figure 2 we show a ‘toy’ example of an autoencoder learning to separate material classes without explicit labels.Unsupervised image compression can discover natural material types.Top right: schematic of an autoencoder network trained on images of natural textures. Images are passed through four convolutional layers with successively fewer units, before being expanded back to the original dimensionality. The learning objective is to minimise the pixelwise difference between original and reconstructed images. Bottom left: by applying the dimensionality reduction method tSNE [76] to 3000 images depicting fur, gravel, or wool, we see that these categories are highly intermixed in image space. The tSNE algorithm embeds high-dimensional data into two dimensions for visualisation, while preserving local distances between nearby points as faithfully as possible. Bottom right: when the same algorithm is applied to the representations of the images within the trained autoencoder’s latent code, strong clusters emerge corresponding to the natural material types.
Prediction in space
A closely related objective deals not with reconstructing inputs pixel-for-pixel, but in predicting pixels from their local neighbourhoods. For example, autoregressive networks like PixelCNN [26,27] and PixelVAE [28] learn a high-order statistical representation of the training set in their latent codes. They are exceptionally good at generating novel images that emulate the structure of natural images. For example, a network trained on thousands of portrait photos synthesises completely new human faces that are close to photographic quality, some samples of which are shown in Figure 1c [29]. Importantly, the latent space representations are systematically organised, such that similar latent values yield similar faces, gradually changing appearance from one identity to another via physically plausible variations (e.g. the nose gradually widens and eyebrows thicken, accurately rendered in the image). The richness of the generative model suggests they are well suited to representing complex natural materials. It is extremely intriguing to investigate how the latent representations in such networks relate to human perceptual judgments of feature appearance and stimulus similarity.
Prediction in time
Temporal prediction may be fundamental to how brains learn and perceive. Predictive coding theories propose that an internal generative model creates predictions of future sensory signals, and then differences between the predictions and subsequent sensory feedback (prediction error signals) update and refine the model [30,31]. A recurrent deep neural network trained by predictive error coding, PredNet [32,33], learnt to predict (i.e. synthesise) future frames in videos of natural environments (Figure 3). PredNet consists of a hierarchy of stages with bidirectional connections, allowing more abstract representations of the movie content to be inferred in deeper layers, and influence predictions at the local pixel level. The network uses long short-term memory (LSTM) units to keep track of long-range temporal dependencies [34]. Intriguingly, PredNet spontaneously discovered, in its deeper layers, higher-level properties of the objects depicted in videos, such as facial identity and pose [32], and its individual units reproduced certain temporal dynamics of primate visual neurons [33]. Thus, to get good at representing the proximal stimulus unfolding over time, the networks tend to infer distal causes. For example, to predict the un-occlusion of previously invisible shape features as an object rotates, or the sudden expansion of specular highlights as they rush across a moving surface, may require deep understanding about how the world works.
Figure 3
Unsupervised video prediction can discover physical scene properties.
A recurrent network of the PredNet architecture [32] trained to predict the next frame in a simple simulated world of rotating checkered cubes. Deeper layers attempt to predict activation in preceding layers (green feedback arrows), while lower layers send up prediction errors (red feedforward arrows) and each layer propagates its current state to the next time point using LSTM units (purple recurrent arrows). Top right: Visualised activations of individual units in response to three frames of a video (brighter pixel values indicate stronger activation to a location in the frame). The unit visualised in the first row responds almost exclusively to the shadow cast by the object, but not to other shadows in the environment or to dark regions on the object. The unit visualised in the second row responds almost exclusively to moving reflectance edges on the object, but not to moving shadow edges or to still edges.
Unsupervised video prediction can discover physical scene properties.A recurrent network of the PredNet architecture [32] trained to predict the next frame in a simple simulated world of rotating checkered cubes. Deeper layers attempt to predict activation in preceding layers (green feedback arrows), while lower layers send up prediction errors (red feedforward arrows) and each layer propagates its current state to the next time point using LSTM units (purple recurrent arrows). Top right: Visualised activations of individual units in response to three frames of a video (brighter pixel values indicate stronger activation to a location in the frame). The unit visualised in the first row responds almost exclusively to the shadow cast by the object, but not to other shadows in the environment or to dark regions on the object. The unit visualised in the second row responds almost exclusively to moving reflectance edges on the object, but not to moving shadow edges or to still edges.Another impressive variant on the theme of prediction is Generative Query Nets [35]. During training, the network is queried to render an image of a simulated 3D environment not simply at the next timepoint, but as it would appear from a different viewpoint. This again encouraged high-level latent scene representations, from which it was possible to decode object identities, positions, shapes and colours without any explicit labelled training data.‘Curiosity-based’ learning is an example of the exciting possibilities that emerge when visual learning is embodied in an agent that does not merely observe passively, but can also act on the world it observes. Motivated by animal learning, the networks actively seek out the most informative parts of those environments during learning [36]. The network outputs both an action (a movement of itself or another object in the scene), and a pixelwise prediction of what its sensory input should be after performing that action. The ‘curiosity’ objective is implemented by training the network to select actions during training for which the network has minimum confidence in its visual prediction — that is, it selects actions for which it does not yet know the consequences.We can expect significant advances in unsupervised and weakly-supervised learning in the coming years as they receive increasing attention in machine learning research [37, 38, 39, 40]. Both biological and artificial visual systems may profit by employing hybrid strategies, in which unsupervised learning creates a robust representation of the structures in our visual worlds, and sparse supervision or reward signals tweak these for the performance of specific tasks [41,42].
Conclusions
It is tempting to formulate vision as the estimation of physical quantities, such as size, distance or reflectance. But to understand complex appearance, we need to let go of ‘inverse optics’. The brain does not estimate predefined physical properties. Instead it represents the ‘typical appearance’ of surfaces and objects in the proximal image. That is, it identifies and represents the statistical ways that images differ from one another. When presented with a bobbly woollen sweater, what would it mean to estimate its ‘bobbliness’? And how would we learn to estimate it with no way of ever knowing the ground truth? This is the key insight. In learning to describe the proximal stimulus efficiently and accurately, the visual system discovers latent variables that are responsible for the image structure: everything from the physics of specular reflectance, to the fact that shirts have buttons evenly spaced in a vertical line. The physical properties of materials are just one example of latent factors that give images their structure. By identifying parameters of variation in the proximal stimulus, statistical appearance models provide a route to inferring the outside world without labelled training data.Unsupervised deep learning offers a framework for implementing this idea. Here we have suggested that learning objectives such as prediction, compression and curiosity give rise to rich internal representations upon which many estimation tasks can be performed. Finding the right unsupervised learning objectives may be the key to explaining both the successes and failures of human material perception, and vision more broadly.
Conflict of interest statement
Nothing declared.
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:• of special interest•• of outstanding interest
Authors: Iyad Rahwan; Manuel Cebrian; Nick Obradovich; Josh Bongard; Jean-François Bonnefon; Cynthia Breazeal; Jacob W Crandall; Nicholas A Christakis; Iain D Couzin; Matthew O Jackson; Nicholas R Jennings; Ece Kamar; Isabel M Kloumann; Hugo Larochelle; David Lazer; Richard McElreath; Alan Mislove; David C Parkes; Alex 'Sandy' Pentland; Margaret E Roberts; Azim Shariff; Joshua B Tenenbaum; Michael Wellman Journal: Nature Date: 2019-04-24 Impact factor: 49.962
Authors: Simon W Davis; Benjamin R Geib; Erik A Wing; Wei-Chun Wang; Mariam Hovhannisyan; Zachary A Monge; Roberto Cabeza Journal: Cereb Cortex Date: 2021-01-05 Impact factor: 5.357
Authors: David D Coggan; David M Watson; Ao Wang; Robert Brownbridge; Christopher Ellis; Kathryn Jones; Charlotte Kilroy; Timothy J Andrews Journal: Eur J Neurosci Date: 2022-06-21 Impact factor: 3.698