Literature DB >> 33880991

Creating and controlling visual environments using BonVision.

Samuel G Solomon¹, Aman B Saleem¹, Gonçalo Lopes², Karolina Farrell¹, Edward Ab Horrocks¹, Chi-Yu Lee¹, Mai M Morimoto¹, Tomaso Muzzu¹, Amalia Papanikolaou¹, Fabio R Rodrigues¹, Thomas Wheatcroft¹, Stefano Zucca¹.

Abstract

Real-time rendering of closed-loop visual environments is important for next-generation understanding of brain function and behaviour, but is often prohibitively difficult for non-experts to implement and is limited to few laboratories worldwide. We developed BonVision as an easy-to-use open-source software for the display of virtual or augmented reality, as well as standard visual stimuli. BonVision has been tested on humans and mice, and is capable of supporting new experimental designs in other animal models of vision. As the architecture is based on the open-source Bonsai graphical programming language, BonVision benefits from native integration with experimental hardware. BonVision therefore enables easy implementation of closed-loop experiments, including real-time interaction with deep neural networks, and communication with behavioural and physiological measurement and manipulation devices.

Entities: Chemical

Keywords: augmented reality; human; mouse; navigation; neuroscience; rat; spatial vision; virtual reality; zebrafish

Mesh：

Year: 2021 PMID： 33880991 PMCID： PMC8104957 DOI： 10.7554/eLife.65541

Source DB: PubMed Journal: Elife ISSN： 2050-084X Impact factor: 8.140

Introduction

Understanding behaviour and its underlying neural mechanisms calls for the ability to construct and control environments that immerse animals, including humans, in complex naturalistic environments that are responsive to their actions. Gaming-driven advances in computation and graphical rendering have driven the development of immersive closed-loop visual environments, but these new platforms are not readily amenable to traditional research paradigms. For example, they do not specify an image in egocentric units (of visual angle), sacrifice precise control of a visual display, and lack transparent interaction with external hardware. Most vision research has been performed in non-immersive environments with standard two-dimensional visual stimuli, such as gratings or dot stimuli, using established platforms including PsychToolbox (Brainard, 1997) or PsychoPy (Peirce, 2007; Peirce, 2008). Pioneering efforts to bring gaming-driven advances to neuroscience research have provided new platforms for closed-loop visual stimulus generation: STYTRA (Štih et al., 2019) provides 2D visual stimuli for larval zebrafish in python, ratCAVE (Del Grosso and Sirota, 2019) is a specialised augmented reality system for rodents in python, FreemoVR (Stowers et al., 2017) provides virtual reality in Ubuntu/Linux, and ViRMEn (Aronov and Tank, 2014) provides virtual reality in Matlab. However, these new platforms lack the generalised frameworks needed to specify or present standard visual stimuli. Our initial motivation was to create a visual display software with three key features. First, an integrated, standardised platform that could rapidly switch between traditional visual stimuli (such as grating patterns) and immersive virtual reality. Second, the ability to replicate experimental workflows across different physical configurations (e.g. when moving from one to two computer monitors, or from flat-screen to spherical projection). Third, the ability for rapid and efficient interfacing with external hardware (needed for experimentation) without needing to develop complex multi-threaded routines. We wanted to provide these advances in a way that made it easier for users to construct and run closed-loop experimental designs. In closed-loop experiments, stimuli are ideally conditioned by asynchronous inputs, such as those provided by multiple independent behavioural and neurophysiological measurement devices. Most existing platforms require the development of multi-threaded routines to run experimental paradigms (e.g. control brain stimulation, or sample from recording devices) without compromising the rendering of visual scenes. Implementing such multi-thread routines is complex. We therefore chose to develop a visual presentation framework within the Bonsai programming language (Lopes et al., 2015). Bonsai is a graphical, high-performance, and event-based language that is widely used in neuroscience experiments and is already capable of real-time interfacing with most types of external hardware. Bonsai is specifically designed for flexible and high-performance composition of data streams and external events, and is therefore able to monitor and connect multiple sensor and effector systems in parallel, making it easier to implement closed-loop experimental designs. We developed BonVision, an open-source software package that can generate and display well-defined visual stimuli in 2D and 3D environments. BonVision exploits Bonsai’s ability to run OpenGL commands on the graphics card through the Bonsai.Shaders package. BonVision further extends Bonsai by providing pre-built GPU shaders and resources for stimuli used in vision research, including movies, along with an accessible, modular interface for composing stimuli and designing experiments. The definition of stimuli in BonVision is independent of the display hardware, allowing for easy replication of workflows across different experimental configurations. Additional unique features include the ability to automatically detect and define the relationship between the observer and the display from a photograph of the experimental apparatus, and to use the outputs of real-time inference methods to determine the position and pose of an observer online, thereby generating augmented reality environments.

Results

To provide a framework that allowed both traditional visual presentation and immersive virtual reality, we needed to bring these very different ways of defining the visual scene into the same architecture. We achieved this by mapping the 2D retino-centric coordinate frame (i.e. degrees of the visual field) to the surface of a 3D sphere using the Mercator projection (Figure 1A, Figure 1—figure supplement 1). The resulting sphere could therefore be rendered onto displays in the same way as any other 3D environment. We then used ‘cube mapping’ to specify the 360° projection of 3D environments onto arbitrary viewpoints around an experimental observer (human or animal; Figure 1B). Using this process, a display device becomes a window into the virtual environment, where each pixel on the display specifies a vector from the observer through that window. The vector links pixels on the display to pixels in the ‘cube map’, thereby rendering the corresponding portion of the visual field onto the display.

Figure 1.

BonVision's adaptable display and render configurations.

(A) Illustration of how two-dimensional textures are generated in BonVision using Mercator projection for sphere mapping, with elevation as latitude and azimuth as longitude. The red dot indicates the position of the observer. (B) Three-dimensional objects were placed at the appropriate positions and the visual environment was rendered using cube-mapping. (C–E) Examples of the same two stimuli, a checkerboard + grating (middle row) or four three-dimensional objects (bottom row), displayed in different experimental configurations (top row): two angled LCD monitors (C), a head-mounted display (D), and demi-spherical dome (E).

(A) Checkerboard stimulus being rendered. (B) Projection of the stimulus onto a sphere using Mercator projection. (C) Example display positions (dA–dF) and (D) corresponding rendered images. Red dot in C indicates the observer position.

Figure 1—figure supplement 1.

Mapping stimuli onto displays in various positions.

BonVision's adaptable display and render configurations.

Mapping stimuli onto displays in various positions.

Modular structure of workflow and example workflows.

(A) Description of the modules in BonVision workflows that generate stimuli. Every BonVision stimuli includes a module that creates and initialises the render window, shown in ‘BonVision window and resources’. This defines the window parameters in Create Window (such as background colour, screen index, VSync), and loads predefined (BonVision Resources) and user defined textures (Texture Resources, not shown), and 3D meshes (Mesh Resources). This is followed by the modules: ‘Drawing region’, where the visual space covered by the stimuli is defined, which can be the complete visual space, 360° × 360°. ‘Draw stimuli’ and ‘Define scene’ are where the stimulus is defined, ‘Map Stimuli’, which maps the stimuli into the 3D environment, and ‘Define display’, where the display devices are defined. (B and C) Modules that define the checkerboard +grating stimulus (B) shown in the middle row of Figure 1 and 3D world (C) with five objects shown in the bottom row of Figure 1. The display device is defined separately and either display can be appended at the end of the workflow. This separation of the display device allows for replication between experimental configurations. (D) The variants of the modules used to display stimuli on a head-mounted display. The empty region under ‘Define scene’ would be filled by the corresponding nodes in B and C. Our approach has the advantage that the visual stimulus is defined irrespectively of display hardware, allowing us to independently define each experimental apparatus without changing the preceding specification of the visual scene, or the experimental design (Figure 1C–E, Figure 1—figure supplements 1 and 2). Consequently, BonVision makes it easy to replicate visual environments and experimental designs on various display devices, including multiple monitors, curved projection surfaces, and head-mounted displays (Figure 1C–E). To facilitate easy and rapid porting between different experimental apparatus, BonVision features a fast semi-automated display calibration. A photograph of the experimental setup with fiducial markers (Garrido-Jurado et al., 2014) measures the 3D position and orientation of each display relative to the observer (Figure 2 and Figure 2—figure supplement 1). BonVision’s inbuilt image processing algorithms then estimate the position and orientation of each marker to fully specify the display environment.

Figure 1—figure supplement 2.

Modular structure of workflow and example workflows.

Figure 2.

Automated calibration of display position.

(A) Schematic showing the position of two hypothetical displays of different sizes, at different distances and orientation relative to the observer (red dot). (B) How a checkerboard of the same visual angle would appear on each of the two displays. (C) Example of automatic calibration of display position. Standard markers are presented on the display, or in the environment, to allow automated detection of the position and orientation of both the display and the observer. These positions and orientations are indicated by the superimposed red cubes as calculated by BonVision. (D) How the checkerboard would appear on the display when rendered, taking into account the precise position of the display. (E and F) Same as (C and D), but for another pair of display and observer positions. The automated calibration was based on the images shown in C and E.

The automated calibration is carried out by taking advantage of ArUco markers (Del Grosso and Sirota, 2019) that can be used to calculate the 3D position of a surface. (Ai) We use one marker on the display and one placed in the position of the observer. We then use a picture of the display and observer position taken by a calibrated camera. This is an example where we used a mobile phone camera for calibration. (Aii) The detected 3D positions of the screen and the observer, as calculated by BonVision. (Aiii) A checkerboard image and a small superimposed patch of grating, rendered based on the precise position of the display. (B and C) same as A and C for different screen and observer positions: with the screen tilted towards the animal (B), or the observer shifted to the right of the screen (C). The automated calibration was based on the images shown in Ai, Bi, and Ci, which in this case were taken using a mobile phone camera.

Figure 2—figure supplement 1.

Automated workflow to calibrate display position.

Automated calibration of display position.

Automated workflow to calibrate display position.

Automated gamma-calibration of visual displays.

BonVision monitored a photodiode (Photodiode v2.1, https://www.cf-hw.org/harp/behavior) through a HARP microprocessor to measure the light output of the monitor (Dell Latitude 7480). The red, green, and blue channels of the display were sent the same values (i.e. grey scale). (A) Gamma calibration. The input to the display channels was modulated by a linear ramp (range 0–255). Without calibration the monitor output (arbitrary units) increased exponentially (blue line). The measurement was then used to construct an intermediate look-up table that corrected the values sent to the display. Following calibration, the display intensity is close to linear (red line). Inset at top: schematic of the experimental configuration. (B) Similar to A, but showing the intensity profile of a drifting sinusoidal grating. Measurements before calibration resemble an exponentiated sinusoid (blue dotted line). Measurements after calibration resemble a regular sinusoid (red dotted line). Virtual reality environments are easy to generate in BonVision. BonVision has a library of standard pre-defined 3D structures (including planes, spheres, and cubes), and environments can be defined by specifying the position and scale of the structures, and the textures rendered on them (e.g. Figure 1—figure supplement 2 and Figure 5F). BonVision also has the ability to import standard format 3D design files created elsewhere in order to generate more complex environments (file formats listed in Materials and methods). This allows users to leverage existing 3D drawing platforms (including open source platform ‘Blender’: https://www.blender.org/) to construct complex virtual scenes (see Appendix 1). BonVision can define the relationship between the display and the observer in real-time. This makes it easy to generate augmented reality environments, where what is rendered on a display depends on the position of an observer (Figure 3A). For example, when a mouse navigates through an arena surrounded by displays, BonVision enables closed-loop, position-dependent updating of those displays. Bonsai can track markers to determine the position of the observer, but it also has turn-key capacity for real-time live pose estimation techniques – using deep neural networks (Mathis et al., 2018; Pereira et al., 2019; Kane et al., 2020) – to keep track of the observer’s movements. This allows users to generate and present interactive visual environments (simulation in Figure 3—video 1 and Figure 3B and C).

Figure 3.

Using BonVision to generate an augmented reality environment.

(A) Illustration of how the image on a fixed display needs to adapt as an observer (red dot) moves around an environment. The displays simulate windows from a box into a virtual world outside. (B) The virtual scene (from: http://scmapdb.com/wad:skybox-skies) that was used to generate the example images and Figure 3—video 1 offline. (C) Real-time simulation of scene rendering in augmented reality. We show two snapshots of the simulated scene rendering, which is also shown in Figure 3—video 1. In each case the inset image shows the actual video images, of a mouse exploring an arena, that were used to determine the viewpoint of an observer in the simulation. The mouse’s head position was inferred (at a rate of 40 frames/s) by a network trained using DeepLabCut (Aronov and Tank, 2014). The top image shows an instance when the animal was on the left of the arena (head position indicated by the red dot in the main panel) and the lower image shows an instance when it was on the right of the arena.

Figure 3—video 1.

Augmented reality simulation using BonVision.

Using BonVision to generate an augmented reality environment.

Augmented reality simulation using BonVision.

This video is an example of a deep neural network, trained with DeepLabCut, being used to estimate the position of a mouse’s head in an environment in real-time, and updating a virtual scene presented on the monitors based on this estimated position. The first few seconds of the video display the online tracking of specific features (nose, head, and base of tail) while an animal is moving around (shown as a red dot) in a three-port box (as in Soares et al., 2016). Subsequently the inset shows the original video of the animal’s movements, which the simulation is based on. The rest of the video image shows how a green field landscape (source: http://scmapdb.com/wad:skybox-skies) outside the box would be rendered on three simulated displays within the box (one placed on each of the three oblique walls). These three displays simulate windows onto the world beyond the box. The position of the animal was updated by DeepLabCut at 40 frames/s, and the simulation was rendered at the same rate. BonVision is capable of rendering visual environments near the limits of the hardware (Figure 4). This is possible because Bonsai is based on a just-in-time compiler architecture such that there is little computational overhead. BonVision accumulates a list of the commands to OpenGL as the programme makes them. To optimise rendering performance, the priority of these commands is ordered according to that defined in the Shaders component of the LoadResources node (which the user can manipulate for high-performance environments). These ordered calls are then executed when the frame is rendered. To benchmark the responsiveness of BonVision in closed-loop experiments, we measured the delay (latency) between an external event and the presentation of a visual stimulus. We first measured the closed-loop latency for BonVision when a monitor was refreshed at a rate of 60 Hz (Figure 4A). We found delays averaged 2.11 ± 0.78 frames (35.26 ± 13.07 ms). This latency was slightly shorter than that achieved by PsychToolbox (Brainard, 1997) on the same laptop (2.44 ± 0.59 frames, 40.73 ± 9.8 ms; Welch’s t-test, p<10−80, n = 1000). The overall latency of BonVision was mainly constrained by the refresh rate of the display device, such that higher frame rate displays yielded lower latency (60 Hz: 35.26 ± 13.07 ms; 90 Hz: 28.45 ± 7.22 ms; 144 Hz: 18.49 ± 10.1 ms; Figure 4A). That is, the number of frames between the external event and stimulus presentation was similar across frame rate (60 Hz: 2.11 ± 0.78 frames; 90 Hz: 2.56 ± 0.65 frames; 144 Hz: 2.66 ± 1.45 frames; Figure 4C). We used two additional methods to benchmark visual display performance relative to other frameworks (we did not try to optimise code fragments for each framework) (Figure 4B and C). BonVision was able to render up to 576 independent elements and up to eight overlapping textures at 60 Hz without missing (‘dropping’) frames, broadly matching PsychoPy (Peirce, 2007; Peirce, 2008) and Psychtoolbox (Brainard, 1997). BonVision’s performance was similar at different frame rates – at standard frame rate (60 Hz) and at 144 Hz (Figure 4—figure supplement 1). BonVision achieved slightly fewer overlapping textures than PsychoPy, as BonVision does not currently have the option to trade-off the resolution of a texture and its mask for performance. BonVision also supports video playback, either by preloading the video or by streaming it from the disk. The streaming mode, which utilises real-time file I/O and decompression, is capable of displaying both standard definition (SD: 480 p) and full HD (HD: 1080 p) at 60 Hz on a standard computer (Figure 4D). At higher rates, performance is impaired for Full HD videos, but is improved by buffering, and fully restored by preloading the video onto memory (Figure 4D). We benchmarked BonVision on a standard Windows OS laptop, but BonVision is now also capable of running on Linux.

Figure 4.

Closed-loop latency and performance benchmarks.

(A) Latency between sending a command (virtual key press) and updating the display (measured using a photodiode). (A.i and A.ii) Latency depended on the frame rate of the display, updating stimuli with a delay of one to three frames. (A.iii and A.iv). (B and C) Benchmarked performance of BonVision with respect to Psychtoolbox and PsychoPy. (B) When using non-overlapping textures BonVision and Psychtoolbox could present 576 independent textures without dropping frames, while PsychoPy could present 16. (C) When using overlapping textures PsychoPy could present 16 textures, while BonVision and Psychtoolbox could present eight textures without dropping frames. (D) Benchmarks for movie playback. BonVision is capable of displaying standard definition (480 p) and high definition (1080 p) movies at 60 frames/s on a laptop computer with a standard CPU and graphics card. We measured display rate when fully pre-loading the movie into memory (blue), or when streaming from disk (with no buffer: orange; 1-frame buffer: green; 2-frame buffer: red; 4-frame buffer: purple). When asked to display at rates higher than the monitor refresh rate (>60 frames/s), the 480 p video played at the maximum frame rate of 60fps in all conditions, while the 1080 p video reached the maximum rate when pre-loaded. Using a buffer slightly improved performance. A black square at the bottom right of the screen in A–C is the position of a flickering rectangle, which switches between black and white at every screen refresh. The luminance in this square is detected by a photodiode and used to measure the actual frame flip times.

Figure 4—figure supplement 1.

BonVision performance benchmarks at high frame rate.

Closed-loop latency and performance benchmarks.

BonVision performance benchmarks at high frame rate.

(A) When using non-overlapping textures BonVision was able to render 576 independent textures without dropping frames at 60 Hz. At 144 Hz BonVision was able to 256 non-overlapping textures, with no dropped frames, and seldom dropped frames with 576 textures. BonVision was unable to render 1024 or more textures at the requested frame rate. (B) When using overlapping textures BonVision was able to render 64 independent textures without dropping frames at 60 Hz. At 144 Hz BonVision was able to render 32 textures, with no dropped frames. Note that these tests were performed on a computer with better hardware specification than that used in Figure 4, which led to improved performance on the benchmarks at 60 Hz. A black square at the bottom right of the screen in A and B is the position of a flickering rectangle, which switches between black and white at every screen refresh. The luminance in this square is detected by a photodiode and used to measure the actual frame flip times. To confirm that the rendering speed and timing accuracy of BonVision are sufficient to support neurophysiological experiments, which need high timing precision, we mapped the receptive fields of neurons early in the visual pathway (Yeh et al., 2009), in the mouse primary visual cortex and superior colliculus. The stimulus (‘sparse noise’) consisted of small black or white squares briefly (0.1 s) presented at random locations (Figure 5A). This stimulus, which is commonly used to measure receptive fields of visual neurons, is sensitive to the timing accuracy of the visual stimulus, meaning that errors in timing would prevent the identification of receptive fields. In our experiments using BonVision, we were able to recover receptive fields from electrophysiological measurements - both in the superior colliculus and primary visual cortex of awake mice (Figure 5B and C) – demonstrating that BonVision meets the timing requirements for visual neurophysiology. The receptive fields show in Figure 5C were generated using timing signals obtained directly from the stimulus display (via a photodiode). BonVision’s independent logging of stimulus presentation timing was also sufficient to capture the receptive field (Figure 5—figure supplement 1).

Figure 5.

Illustration of BonVision across a range of vision research experiments.

(A) Sparse noise stimulus, generated with BonVision, is rendered onto a demi-spherical screen. (and) Receptive field maps from recordings of local field potential in the superior colliculus (B), and spiking activity in the primary visual cortex (C) of mouse. (D) Two cubes were presented at different depths in a virtual environment through a head-mounted display to human subjects. Subjects had to report which cube was larger: left or right. (E) Subjects predominantly reported the larger object correctly, with a slight bias to report that the object in front was bigger. (F) BonVision was used to generate a closed-loop virtual platform that a mouse could explore (top: schematic of platform). Mice naturally tended to run faster along the platform, and in later sessions developed a speed profile, where they slowed down as they approached the end of the platform (virtual cliff). (G) The speed of the animal at the start of the platform and at the end of the platform as a function training. (H) BonVision was used to present visual stimuli overhead while an animal was free to explore an environment (which included a refuge). The stimulus was a small dot (5° diameter) moving across the projected surface over several seconds. (I) The cumulative probability of Freeze and Flight behaviour across time in response to moving dot presented overhead.

Figure 5—figure supplement 1.

BonVision timing logs are sufficient to support receptive field mapping of spiking activity.

Illustration of BonVision across a range of vision research experiments.

BonVision timing logs are sufficient to support receptive field mapping of spiking activity.

Top row in each case shows the receptive field identified using the timing information provided by a photodiode that monitored a small square on the stimulus display that was obscured from the animal. Bottom row in each case shows the receptive field identified by using the timing logged by BonVision during the stimulus presentation (a separate timing system was used to align the clocks between the computer hosting BonVision and the Open EPhys recording device). (A) Average OFF and ON receptive field maps for 33 simultaneously recorded units in a single recording session. (B) Individual OFF receptive field maps for three representative units in the same session. To assess the ability of BonVision to control virtual reality environments we first tested its ability to present stimuli to human observers on a head-mounted display (Scarfe and Glennerster, 2015). BonVision uses positional information (obtained from the head-mounted display) to update the view of the world that needs to be provided to each eye, and returns two appropriately rendered images. On each trial, we asked observers to identify the larger of two non-overlapping cubes that were placed at different virtual depths (Figure 5D and E). The display was updated in closed-loop to allow observers to alter their viewpoint by moving their head. Distinguishing objects of the same retinal size required observers to use depth-dependent cues (Rolland et al., 1995), and we found that all observers were able to identify which cube was larger (Figure 5E). We next asked if BonVision was capable of supporting other visual display environments that are increasingly common in the study of animal behaviour. We first projected a simple environment onto a dome that surrounded a head-fixed mouse (as shown in Figure 1E). The mouse was free to run on a treadmill, and the treadmill’s movements were used to update the mouse’s position on a virtual platform (Figure 5F). Not only did mouse locomotion speed increase with repeated exposure, but the animals modulated their speed depending on their location in the platform (Figure 5F and G). BonVision is therefore capable of generating virtual reality environments which both elicit and are responsive to animal behaviour. BonVision was also able to produce instinctive avoidance behaviours in freely moving mice (Figure 5H and I). We displayed a small black dot slowly sweeping across the overhead visual field. Visual stimuli presented in BonVision primarily elicited a freezing response, which similar experiments have previously described (De Franceschi et al., 2016; Figure 5I). Together these results show that BonVision provides sufficient rendering performance to support human and animal visual behaviour.

Discussion

BonVision is a single software package to support experimental designs that require visual display, including virtual and augmented reality environments. BonVision is easy and fast to implement, cross-platform and open source, providing versatility and reproducibility. BonVision makes it easier to address several barriers to reproducibility in visual experiments. First, BonVision is able to replicate and deliver visual stimuli on very different experimental apparatus. This is possible because BonVision’s architecture separates specification of the display and the visual environment. Second, BonVision includes a library of workflows and operators to standardise and ease the construction of new stimuli and virtual environments. For example, it has established protocols for defining display positions (Figure 3), mesh-mapping of curved displays (Figure 1E), and automatic linearisation of display luminance (Figure 4), as well as a library of examples for experiments commonly used in visual neuroscience. In addition, the modular structure of BonVision enables the development and exchange of custom nodes for generating new visual stimuli or functionality without the need to construct the complete experimental paradigm. Third, BonVision is based on Bonsai (Lopes et al., 2015), which has a large user base and an active developer community, and is now a standard tool for open-source neuroscience research. BonVision naturally integrates Bonsai’s established packages in the multiple domains important for modern neuroscience, which are widely used in applications including real-time video processing (Zacarias et al., 2018; Buccino et al., 2018), optogenetics (Zacarias et al., 2018; Buccino et al., 2018; Moreira et al., 2019), fibre photometry (Soares et al., 2016; Hrvatin et al., 2020), electrophysiology (including specific packages for Open Ephys Siegle et al., 2017; Neto et al., 2016 and high-density silicon probes Jun et al., 2017; Dimitriadis, 2018), and calcium imaging (e.g. UCLA miniscope Aharoni et al., 2019; Cai et al., 2016). Bonsai requires researchers to get accustomed to its graphical interface and event-based framework. However, it subsequently reduces the time required to learn real-time programming, and the time to build new interfaces with external devices (see Appendix 1). Moreover, since Bonsai workflows can be called via the command line, BonVision can also be integrated into pre-existing, specialised frameworks in established laboratories. In summary, BonVision can generate complex 3D environments and retinotopically defined 2D visual stimuli within the same framework. Existing platforms used for vision research, including PsychToolbox (Brainard, 1997), PsychoPy (Peirce, 2007; Peirce, 2008), STYTRA (Štih et al., 2019), or RigBox (Bhagat et al., 2020), focus on well-defined 2D stimuli. Similarly, gaming-driven software, including FreemoVR (Stowers et al., 2017), ratCAVE (Del Grosso and Sirota, 2019), and ViRMEn (Aronov and Tank, 2014), are oriented towards generating virtual reality environments. BonVision combines the advantages of both these approaches in a single framework (Appendix 1), while bringing the unique capacity to automatically calibrate the display environment, and use deep neural networks to provide real-time control of virtual environments. Experiments in BonVision can be rapidly prototyped and easily replicated across different display configurations. Being free, open-source, and portable, BonVision is a state-of-the-art tool for visual display that is accessible to the wider community.

Materials and methods

Benchmarking

We performed benchmarking to measure latencies and skipped (‘dropped’) frames. For benchmarks at 60 Hz refresh rate, we used a standard laptop with the following configuration: Dell Latitude 7480, Intel Core i7-6600U Processor Base with Integrated HD Graphics 520 (Dual Core, 2.6 GHz), 16 GB RAM. For higher refresh rates we used a gaming laptop ASUS ROG Zephyrus GX501GI, with an Intel Core i7-8750H (six cores, 2.20 GHz), 16 GB RAM, equipped with a NVIDIA GeForce GTX 1080. The gaming laptop's built-in display refreshes at 144 Hz, and for measuring latencies at 90 Hz we connected it to a Vive Pro SteamVR head-mounted display (90 Hz refresh rate). All tests were run on Windows 10 Pro 64-bit. To measure the time from input detection to display update, as well as dropped frames detection, we used open-source HARP devices from Champalimaud Research Scientific Hardware Platform, using the Bonsai.HARP package. Specifically we used the HARP Behavior device (a lost latency DAQ; https://www.cf-hw.org/harp/behavior), to synchronise all measurements with the extensions: ‘Photodiode v2.1’ to measure the change of the stimulus on the screen, and ‘Mice poke simple v1.2’ as the nose poke device to externally trigger changes. To filter out the infrared noise generated from an internal LED sensor inside the Vive Pro HMD, we positioned an infrared cut-off filter between the internal headset optics and the photodiode. Typically, the minimal latency for any update is two frames: one which is needed for the VSync, and one is the delay introduced by the OS. Display hardware can add further delays if they include additional buffering. Benchmarks for video playback were carried out using a trailer from the Durian Open Movie Project ( copyright Blender Foundation | durian.blender.org). All benchmark programmes and data are available at https://github.com/bonvision/benchmarks.

File formats

We tested the display of images and videos using the image and video benchmark workflows. We confirmed the ability to use the following image formats: PNG, JPG, BMP, TIFF, and GIF. Movie display relies on the FFmpeg library (https://ffmpeg.org/), an industry standard, and we confirmed ability to use the following containers: AVI, MP4, OGG, OGV, and WMV; in conjunction with standard codecs: H264, MPEG4, MPEG2, DIVX. Importing 3D models and complex scenes relies on the Open Asset Importer Library (Assimp | http://assimp.org/). We confirmed the ability to import and render 3D models and scenes from the following formats: OBJ, Blender.

Animal experiments

All experiments were performed in accordance with the Animals (Scientific Procedures) Act 1986 (United Kingdom) and Home Office (United Kingdom) approved project and personal licenses. The experiments were approved by the University College London Animal Welfare Ethical Review Board under Project License 70/8637. The mice (C57BL6 wild-type) were group-housed with a maximum of five to a cage, under a 12 hr light/dark cycle. All behavioural and electrophysiological recordings were carried out during the dark phase of the cycle.

Innate defensive behaviour

Mice (five male, C57BL6, 8 weeks old) were placed in a 40 cm square arena. A dark refuge placed outside the arena could be accessed through a 10 cm door in one wall. A DLP projector (Optoma GT760) illuminated a screen 35 cm above the arena with a grey background (80 candela/m2). When the mouse was near the centre of the arena, a 2.5 cm black dot appeared on one side of the projection screen and translated smoothly to the opposite side over 3.3 s. Ten trials were conducted over 5 days and the animal was allowed to explore the environment for 5–10 min before the onset of each trial. Mouse movements were recorded with a near infrared camera (Blackfly S, BFS-U3-13Y3M-C, sampling rate: 60 Hz) positioned over the arena. An infrared LED was used to align video and stimulus. Freezing was defined as a drop in the animal speed below 2 cm/s that lasted more than 0.1 s; flight responses as an increase in the animal running speed above 40 cm/s (De Franceschi et al., 2016). Responses were only considered if they occurred within 3.5 s from stimulus onset.

Surgery

Mice were implanted with a custom-built stainless-steel metal plate on the skull under isoflurane anaesthesia. A ~1 mm craniotomy was performed either over the primary visual cortex (2 mm lateral and 0.5 mm anterior from lambda) or superior colliculus (0.5 mm lateral and 0.2 mm anterior from lambda). Mice were allowed to recover for 4–24 hr before the first recording session. We used a virtual reality apparatus similar to those used in previous studies (Schmidt-Hieber and Häusser, 2013; Muzzu et al., 2018). Briefly, mice were head-fixed above a polystyrene wheel with a radius of 10 cm. Mice were positioned in the geometric centre of a truncated spherical screen onto which we projected the visual stimulus. The visual stimulus was centred at +60° azimuth and +30° elevation and had a span of 120° azimuth and 120° elevation.

Virtual reality behaviour

Five male, 8-week-old, C57BL6 mice were used for this experiment. One week after the surgery, mice were placed on a treadmill and habituated to the virtual reality (VR) environment by progressively increasing the number of time spent head fixed: from ~15 min to 2 hr. Mice spontaneously ran on the treadmill, moving through the VR in absence of reward. The VR environment was a 100 cm long platform with a patterned texture that animals ran over for multiple trials. Each trial started with an animal at the start of the platform and ended when it reached the end, or if 60 s had elapsed. At the end of a trial, there was a 2 s grey interval before the start of the next trial.

Neural recordings

To record neural activity, we used multi-electrode array probes with two shanks and 32 channels (ASSY-37 E-1, Cambridge Neurotech Ltd., Cambridge, UK). Electrophysiology data was acquired with an Open Ephys acquisition board connected to a different computer from that used to generate the visual stimulus. The electrophysiological data from each session was processed using Kilosort 1 or Kilosort 2 (Pachitariu et al., 2016). We synchronised spike times with behavioural data by aligning the signal of a photodiode that detected the visual stimuli transitions (PDA25K2, Thorlabs, Inc, USA). We sampled the firing rate at 60 Hz, and then smoothed it with a 300 ms Gaussian filter. We calculated receptive fields as the average firing rate or local field potential elicited by the appearance of a stimulus in each location (custom routines in MATLAB).

Augmented reality for mice

The mouse behaviour videos were acquired by Bruno Cruz from the lab of Joe Paton at the Champalimaud Centre for the Unknown, using methods similar to Soares et al., 2016. A ResNet-50 network was trained using DeepLabCut (Mathis et al., 2018; Kane et al., 2020). We simulated a visual environment in which a virtual scene was presented beyond the arena, and updated the scenes on three walls of the arena. This simulated how the view changed as the animal moved through the environment. The position of the animal was updated from the video file at a rate of 40 frames/s on a gaming laptop: ASUS ROG Zephyrus GX501GI, with an Intel Core i7-8750H (six cores, 2.20 GHz), 16 GB RAM, equipped with a NVIDIA GeForce GTX 1080, using a 512 × 512 video. The performance can be improved using a lower pixel resolution for video capture, and we were able to achieve up to 80 frames/s without a noticeable decrease in tracking accuracy using this strategy. Further enhancements can be achieved using a MobileNetV2 network (Kane et al., 2020). The position inference from the deep neural network and the BonVision visual stimulus rendering were run on the same machine.

Human psychophysics

All procedures were approved by the Experimental Psychology Ethics Committee at University College London (Ethics Application EP/2019/002). We obtained informed consent and consent to publish from all participants. Four male participants were tested for this experiment. The experiments were run on a gaming laptop (described above) connected to a Vive Pro SteamVR head-mounted display (90 Hz refresh rate). BonVision is compatible with different headsets (e.g. Oculus Rift, HTC Vive). BonVision receives the projection matrix (perspective projection of world display) and the view matrix (position of eye in the world) for each eye from the head set. BonVision uses these matrices to generate two textures, one for the left eye and one for the right eye. Standard onboard computations on the headset provide additional non-linear transformations that account for the relationship between the eye and the display (such as lens distortion effects).

Code availability

BonVision is an open-source software package available to use under the MIT license. It can be downloaded through the Bonsai (bonsai-rx.org) package manager, and the source code is available at: github.com/bonvision/BonVision. All benchmark programmes and data are available at https://github.com/bonvision/benchmarks (copy archived at swh:1:rev:7205c04aa8fcba1075e9c9991ac117bd25e92639, Lopes, 2021). Installation instructions, demos, and learning tools are available at: bonvision.github.io/. Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work. Acceptance summary: Increasingly, neuroscience experiments require immersive virtual environments that approximate natural sensory motor loops while permitting high-bandwidth measurements of brain activity. BonVision is an open-source graphics programming library that allows experimenters to quickly implement immersive 3D visual environments across display hardware and geometry with automated calibration and integration with hundreds of different neural recording technologies, behavioral apparatuses, etc. BonVision standardizes sharing complex, closed-loop visual tasks between labs with vastly different equipment and provides a concrete and easy way to do so. Decision letter after peer review: Thank you for submitting your article "Creating and controlling visual environments using BonVision" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Chris Baker as the Senior and Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Jonathan P Newman (Reviewer #1); André Maia Chagas (Reviewer #2); Sue Ann Koay (Reviewer #3). The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this letter to help you prepare a revised submission. Essential revisions: In general, the reviewers were very positive about the manuscript and appreciated the time and effort taken both to develop BonVision and write this manuscript. The major concerns reflect a desire from the reviewers to see more detail on specific points as well as clarification over some of the statements made. In your revision please address the specific recommendations below. Reviewer #1 (Recommendations for the authors): General comment: There are two measures of performance that are not explored in the manuscript but may aid in describing BonVision's advantages over alternative software. The first is the improved performance and ease of use compared to alternatives in cases where the input used to drive visual stimuli consists of mixtures of asynchronous data sources (e.g. ephys and behavioral measurements together). This is something I imagine BonVision could do with less effort and greater computational efficiency than alternative software. The animal experiments provided are good benchmarks because they are common designs, but do not demonstrate BonVision's immense potential for easily creating visual tasks with complex IO and stimulus contingencies. The second is a measure of human effort required to implement an experiment using BonVision compared to imperative, text-based alternatives. I think both of these issues could be tackled in the discussion by expanding a bit on Lines 144-146: why are BonVision and Bonsai so good at data stream composition compared to alternatives, and why is a visual programming approach so appropriate for Bonsai/BonVision's target use cases? General comment: Following up on my desire for a more detailed explanation of the operation of the Bonsai.Shaders library, most of its operators have obvious relations to traditional OpenGL programming operations. However, an explanation of how the traditionally global state machine (context) of OpenGL was mapped onto the Bonsai.Shaders nodes and how temporal order of OpenGL context option manipulation is enforced might be helpful for those wishing to understand the underlying mechanics of BonVision and create their own tools using the Shaders library. Line 11: The use of the word "timing" is ambiguous to me. Are the authors referring to closed loop reaction times and determinism, hardware IO delays, the combination of samples from asynchronous data streams, or all of the above? Lines 13 and 22: The authors correctly state that graphics programming requires advanced training. However, the use of Bonsai, a functional language that operates entirely on Observable Sequences, also requires quite a lot of training to use effectively. I do think the authors have a point here, and I agree Bonsai is tool worth learning, but I feel the main strength of using Bonsai is its (broadly defined) performance (speed, elegance when dealing with asynchronous data, ease and formality of experiment sharing, ease of rapid prototyping, etc) rather than its learning curve. This point is exacerbated by the lack of documentation (outside of the source code) for many Bonsai features. Line 64: Adding a parenthetical link to the Blender website seems appropriate. Line 97: The model species should be stated here. Figure 4(C): There is a single instance of BonVision being outperformed by PsychoPy3 in the case of 16-layer texture blending at 60 FPS. Can the authors comment on why this might be (e.g. PsychoPy3's poor performance at low layer counts is due to some memory bottleneck?) and why this matters (or does not matter) practically in the context of BonVision's target uses? Figure 4(A-C): The cartoons of display screens have little black boxes in the lower right corners and I'm not sure what they mean. Figure 5(A): As mentioned previously, it seems that these are post-hoc temporally aligned receptive fields (RFs). Is it worth seeing what the RFs created without post-hoc photodiode-based alignment of stimuli onset look like so that we can see the effect of display presentation jitter (or lack thereof)? This would be a nice indication of the utility of the package for real-time stimulus shaping for system ID purposes where ground truth alignment might not be possible. This is made more relevant given BonVision's apparent larger latency jitter compared to PsychToolbox (Figure 4A). Figure 5(D): Although useful, the size discrimination task probably does not cover all potential corner cases with this type of projection. I don't think more experiments need to be performed but a more thorough theoretical comparison with other methods, e.g. sphere mapping, might be useful to motivate the choice of cube mapping for rendering 3D objects, perhaps in the discussion. Figure 5(I): The caption refers to the speed of the animal on the ordinate axis but that figure seems to display a psychometric curve for freezing or running behaviors over time from stimulus presentation. Line 293 and 319-322: HARP is a small enough project that I feel some explanation of the project's intentions and capabilities and the Bonsai library used to acquire it from HARP hardware might be useful. Line 384: "OpenEphys" should be changed to "Open Ephys" in the text and in reference 13 used to cite the Acquisition Board's use. Reviewer #2 (Recommendations for the authors): – Figures 1, 2, 3 and Supp1 – the indication of the observer in these figures is sometimes a black, and sometimes a red dot. This is not bad, but I think you could streamline your figures if on the first one you had a legend for what represents the observer (ie observer = red dot) and have the same pattern through the figures? – Figure 2 – If I understand correctly, in panels C and E, the markers are read by an external camera, which I am supposing in this case are the laptop camera? If this is the case, could you please change these panels so that they explicitly show where the cameras are? Maybe adding the first top left panel from supp Figure 3 to Figure 2 and indicate from where the markers are read would solve this? – Figure 5 – Panel I: the legend states "The speed of the animal across different trials, aligned to the time of stimulus appearance." but the figure Y axis states Cumulative probability. I guess the legend needs updating? Also it is not clear to me how the cumulative probabilities of freeze and flight can sum up to more than one, as it seems to be the case from the figure? I am assuming that an animal either freezes of flees in this test? Maybe I missed something? – In the results, lines 40 to 42, the authors describe how they have managed to have a single framework for both traditional visual presentation and immersive virtual reality. Namely they project 2D coordinate frame to a 3D sphere using Mercator projection. I would like to ask the authors to explain a bit how they deal with the distortions present in this type of projection. As far as I understand, this type of projection inflates the size of objects that are further away from the sphere midline (with increased intensity the further away)? Is this compensated for in the framework somehow? Would it make sense to offer users the option to choose different projections depending on their application? – In line 62 "BonVision also has the ability to import standard format 3D design files" could the authors specify which file formats are accepted? – When benchmarking BonVision (starting on line 73), the authors focus on 60Hz stimulus presentation using monitors with different capabilities. This is great, as it addresses main points for human, non-human primates and rodent experiments. I believe however that it would be great for the paper and the community in general if the authors could do some benchmarking with higher frame rates and contextualize BonVision for the use with other animal models, such as Fly, fish, etc. Given that there are a couple of papers describing visual stimulators that take care of the different wavelengths needed to stimulate the visual system of these animals, it seems to me that BonVision would be a great tool to create stimuli and environments for these stimulators and animal models. Reviewer #3 (Recommendations for the authors): I have a few presentation style points where I feel the text should be more careful not to come across as unintendedly too strong, or otherwise justification need to be provided to substantiate the claims. Most importantly, line 19 "the ability for rapid and efficient interfacing with external hardware (needed for experimentation) without development of complex multi-threaded routines" is a bit mysterious to me because I am unsure what these external hardware are that BonVision facilitates interfacing with. For example, experimenters do prefer multi-threaded routines where the other threads are used to trigger reward delivery, sensory stimuli of other modalities, or control neural stimulation or recording devices. This is in order to avoid blocking execution of the visual display software when these other functions are called. If BonVision provides a solution for these kinds of experiment interfacing requirements, I think they are definitely important enough to mention in the text. Otherwise, the sentence of line 19 needs some work in order to make it clear as to exactly which functionalities of BonVision are being referred to. The other claims that stood out to me are as follows. In the abstract it is said that "Real-time rendering… necessary for next-generation…", but I don't know if anybody can actually claim that any one method is necessary. In line 116, "suggesting habituation to the virtual environment", the authors can also acknowledge that mice might simply be habituating to the rig (e.g. even if there was no visual display), since this does not seem to be a major claim that needs to be made. The virtual cliff effect (line 118) also seems very interesting, but the authors have not fully demonstrated that mice are not alternatively responding to a change in floor texture. It is also unclear to me why a gray floor (which looks to be equiluminant with the rest of the textured floor at least by guessing from Figure 5F) should be visually identified as a cliff, as opposed to, say, black. In order to make this claim about visual cliff identification especially without binocular vision, the authors would probably have to show experiments where the mice do not slow down at other floor changes (to white maybe?), but I'm unsure as to whether the data exists for this nor whether it is worth the effort. Overall I don't see a reason why the authors should attempt to claim that "BonVision is capable of eliciting naturalistic behaviors in a virtual environment", since the naturalness of rodent behaviors in virtual environments is a topic of debate in some circles, independent of the software used to generate those environments. I figure it's better to stay away unless this is a fight that one desires to fight. Reviewer #1 (Recommendations for the authors): General comment: There are two measures of performance that are not explored in the manuscript but may aid in describing BonVision's advantages over alternative software. The first is the improved performance and ease of use compared to alternatives in cases where the input used to drive visual stimuli consists of mixtures of asynchronous data sources (e.g. ephys and behavioral measurements together). This is something I imagine BonVision could do with less effort and greater computational efficiency than alternative software. The animal experiments provided are good benchmarks because they are common designs, but do not demonstrate BonVision's immense potential for easily creating visual tasks with complex IO and stimulus contingencies. The second is a measure of human effort required to implement an experiment using BonVision compared to imperative, text-based alternatives. I think both of these issues could be tackled in the discussion by expanding a bit on Lines 144-146: why are BonVision and Bonsai so good at data stream composition compared to alternatives, and why is a visual programming approach so appropriate for Bonsai/BonVision's target use cases? We agree and we have now revised the Introduction and Discussion to better make these points transparent (particularly around lines 44-55 and 235-239). General comment: Following up on my desire for a more detailed explanation of the operation of the Bonsai.Shaders library, most of its operators have obvious relations to traditional OpenGL programming operations. However, an explanation of how the traditionally global state machine (context) of OpenGL was mapped onto the Bonsai.Shaders nodes and how temporal order of OpenGL context option manipulation is enforced might be helpful for those wishing to understand the underlying mechanics of BonVision and create their own tools using the Shaders library. We thank the reviewer for prompting us. Generally, we now mention that we build on the Bonsai.Shaders package in new text (lines 58-59 and in Supplementary Details). Specifically, related to the point related to the temporal order of OpenGL, we also include the text (in lines 133-135): “BonVision accumulates a list of the commands to OpenGL as the program makes them. To optimise rendering performance, the priority of these commands is ordered according to that defined in the Shaders component of the LoadResources node (which the user can manipulate for high-performance environments). These ordered calls are then executed when the frame is rendered.” Line 11: The use of the word "timing" is ambiguous to me. Are the authors referring to closed loop reaction times and determinism, hardware IO delays, the combination of samples from asynchronous data streams, or all of the above? Thank you for picking this up – the organisation of the first paragraph meant that the subject of this sentence was unclear, and we have now tried to make this paragraph clearer, including splitting it into two distinct points (lines 3-17). We hope these changes now address the reviewers point. Lines 13 and 22: The authors correctly state that graphics programming requires advanced training. However, the use of Bonsai, a functional language that operates entirely on Observable Sequences, also requires quite a lot of training to use effectively. I do think the authors have a point here, and I agree Bonsai is tool worth learning, but I feel the main strength of using Bonsai is its (broadly defined) performance (speed, elegance when dealing with asynchronous data, ease and formality of experiment sharing, ease of rapid prototyping, etc) rather than its learning curve. This point is exacerbated by the lack of documentation (outside of the source code) for many Bonsai features. We agree and have revised the Introduction (lines 42-55) and Discussion (lines 235-239) to make this clearer. Line 64: Adding a parenthetical link to the Blender website seems appropriate. Line 97: The model species should be stated here. Done. Figure 4(C): There is a single instance of BonVision being outperformed by PsychoPy3 in the case of 16-layer texture blending at 60 FPS. Can the authors comment on why this might be (e.g. PsychoPy3's poor performance at low layer counts is due to some memory bottleneck?) and why this matters (or does not matter) practically in the context of BonVision's target uses? In the conditions under which the benchmarking was performed, PsychoPy was able to present more overlapping stimuli compared to BonVision and PsychToolBox, because PsychoPy presented stimuli at a lower resolution compared to the other systems. We now indicate this in the main text of the manuscript (lines 150-151). Figure 4(A-C): The cartoons of display screens have little black boxes in the lower right corners and I'm not sure what they mean. The black square represents the position of a flickering square, the luminance of which is detected by a photodiode and used to measure frame display times. We have now updated the legend of Figure 4 to make this clear. Figure 5(A): As mentioned previously, it seems that these are post-hoc temporally aligned receptive fields (RFs). Is it worth seeing what the RFs created without post-hoc photodiode-based alignment of stimuli onset look like so that we can see the effect of display presentation jitter (or lack thereof)? This would be a nice indication of the utility of the package for real-time stimulus shaping for system ID purposes where ground truth alignment might not be possible. This is made more relevant given BonVision's apparent larger latency jitter compared to PsychToolbox (Figure 4A). We thank the reviewer for this suggestion. We now include receptive field maps calculated using the BonVision timing log in Figure5—figure supplement 1. Using the BonVision timing alone was also effective in identifying receptive fields. Figure 5(D): Although useful, the size discrimination task probably does not cover all potential corner cases with this type of projection. I don't think more experiments need to be performed but a more thorough theoretical comparison with other methods, e.g. sphere mapping, might be useful to motivate the choice of cube mapping for rendering 3D objects, perhaps in the discussion. We now clarify that we use the size discrimination task as a simple test of the ability of BonVision to run VR stimuli on a head-mounted display. Although we considered the different mapping styles, we settled on cube mapping for 3D stimuli, as this is currently the standard for 3D rendering systems, and the most computationally efficient. We included a detailed discussion on the merits and issues with Mercatore projection for 2D stimuli in the new section “Appendix 1”. Figure 5(I): The caption refers to the speed of the animal on the ordinate axis but that figure seems to display a psychometric curve for freezing or running behaviors over time from stimulus presentation. Thank you for pointing this out, we have now corrected this. Line 293 and 319-322: HARP is a small enough project that I feel some explanation of the project's intentions and capabilities and the Bonsai library used to acquire it from HARP hardware might be useful. We have now added more information on the HARP sources, and why we have employed it here, including details of the Bonsai library needed to use the HARP device (lines 648-652). However, we are not core members of the HARP project and are wary of speaking for them on its intentions and other capabilities. Line 384: "OpenEphys" should be changed to "Open Ephys" in the text and in reference 13 used to cite the Acquisition Board's use. Done. Reviewer #2 (Recommendations for the authors): – Figures 1, 2, 3 and Supp1 – the indication of the observer in these figures is sometimes a black, and sometimes a red dot. This is not bad, but I think you could streamline your figures if on the first one you had a legend for what represents the observer (ie observer = red dot) and have the same pattern through the figures? Great suggestion, thank you. We have now changed all observers to red dots and indicated this in the legend. – Figure 2 – If I understand correctly, in panels C and E, the markers are read by an external camera, which I am supposing in this case are the laptop camera? If this is the case, could you please change these panels so that they explicitly show where the cameras are? Maybe adding the first top left panel from supp Figure 3 to Figure 2 and indicate from where the markers are read would solve this? We think that the reviewer had spotted that there are multiple cameras shown in the image, and we apologise for not spotting this ourselves. The calibration is performed using only the images shown (that is the camera that is taking the image is the one used for the calibration). We now make this clearer in the legend to Figure 2. – Figure 5 – Panel I: the legend states "The speed of the animal across different trials, aligned to the time of stimulus appearance." but the figure Y axis states Cumulative probability. I guess the legend needs updating? Also it is not clear to me how the cumulative probabilities of freeze and flight can sum up to more than one, as it seems to be the case from the figure? I am assuming that an animal either freezes of flees in this test? Maybe I missed something? We thank the reviewer for highlighting this error. We have updated the legend. – In the results, lines 40 to 42, the authors describe how they have managed to have a single framework for both traditional visual presentation and immersive virtual reality. Namely they project 2D coordinate frame to a 3D sphere using Mercator projection. I would like to ask the authors to explain a bit how they deal with the distortions present in this type of projection. As far as I understand, this type of projection inflates the size of objects that are further away from the sphere midline (with increased intensity the further away)? Is this compensated for in the framework somehow? Would it make sense to offer users the option to choose different projections depending on their application? This is an excellent point. We have added a specific discussion related to the Mercator projection in the new section called Appendix, where we discuss the distortions and methods to work around them. – In line 62 "BonVision also has the ability to import standard format 3D design files" could the authors specify which file formats are accepted? We now link from the main text to the ‘File Formats’ section in Methods. – When benchmarking BonVision (starting on line 73), the authors focus on 60Hz stimulus presentation using monitors with different capabilities. This is great, as it addresses main points for human, non-human primates and rodent experiments. I believe however that it would be great for the paper and the community in general if the authors could do some benchmarking with higher frame rates and contextualize BonVision for the use with other animal models, such as Fly, fish, etc. Given that there are a couple of papers describing visual stimulators that take care of the different wavelengths needed to stimulate the visual system of these animals, it seems to me that BonVision would be a great tool to create stimuli and environments for these stimulators and animal models. We have added a new Figure 4—figure supplement 1, in which we show the results of the non-overlapping textures benchmark for BonVision at 144 Hz refresh. Comparison with the same data obtained at 60 Hz shows little deterioration in performance. These new data supplement the extant tests in Figure 4A, where we tested the closed-loop latency at these higher frame rates. Reviewer #3 (Recommendations for the authors): I have a few presentation style points where I feel the text should be more careful not to come across as unintendedly too strong, or otherwise justification need to be provided to substantiate the claims. Most importantly, line 19 "the ability for rapid and efficient interfacing with external hardware (needed for experimentation) without development of complex multi-threaded routines" is a bit mysterious to me because I am unsure what these external hardware are that BonVision facilitates interfacing with. For example, experimenters do prefer multi-threaded routines where the other threads are used to trigger reward delivery, sensory stimuli of other modalities, or control neural stimulation or recording devices. This is in order to avoid blocking execution of the visual display software when these other functions are called. If BonVision provides a solution for these kinds of experiment interfacing requirements, I think they are definitely important enough to mention in the text. Otherwise, the sentence of line 19 needs some work in order to make it clear as to exactly which functionalities of BonVision are being referred to. We agree and have now revised the Introduction (lines 42-55) to make these points clearer. The other claims that stood out to me are as follows. In the abstract it is said that "Real-time rendering… necessary for next-generation…", but I don't know if anybody can actually claim that any one method is necessary. We have changed the text to say ‘important’ rather than ‘necessary’. In line 116, "suggesting habituation to the virtual environment", the authors can also acknowledge that mice might simply be habituating to the rig (e.g. even if there was no visual display), since this does not seem to be a major claim that needs to be made. The virtual cliff effect (line 118) also seems very interesting, but the authors have not fully demonstrated that mice are not alternatively responding to a change in floor texture. It is also unclear to me why a gray floor (which looks to be equiluminant with the rest of the textured floor at least by guessing from Figure 5F) should be visually identified as a cliff, as opposed to, say, black. In order to make this claim about visual cliff identification especially without binocular vision, the authors would probably have to show experiments where the mice do not slow down at other floor changes (to white maybe?), but I'm unsure as to whether the data exists for this nor whether it is worth the effort. Overall I don't see a reason why the authors should attempt to claim that "BonVision is capable of eliciting naturalistic behaviors in a virtual environment", since the naturalness of rodent behaviors in virtual environments is a topic of debate in some circles, independent of the software used to generate those environments. I figure it's better to stay away unless this is a fight that one desires to fight. We agree that there are heated debates around these issues in the field, and that this is not the place to have those discussions. We have changed the relevant sentence to read (lines 204-205): “BonVision is therefore capable of generating virtual reality environments which both elicit, and are responsive to animal behaviour.”

Appendix 1—table 1.

Features of visual display software.

Features	BonVision	PsychToolbox	PsychoPy	ViRMEn	ratCAVE	FreemoVR	Unity
Free and Open-source (FOSS)	√√	√#	√√	√#	√	√√	√
Rendering of 3D environments	√√	√	√	√√	√√	√√	√√
Dynamic rendering based on observer viewpoint	√√			√	√√	√√	√
GUI for designing 3D scenes				√√			√√
Import 3^rd party 3D scenes	√√	√	√				√√
Real-time interactive 3D scenes	√√	√		√√	√√	√√	√√
Web-based deployment			√√				√√
Interfacing with cameras, sensors and effectors	√√	√√	~	√√		~	~
Real-time hardware control	√√	~	~	√	√√	√	√
Traditional visual stimuli	√√	√√	√√
Auto-calibration of display position and pose	√√
Integration with deep learning pose estimation	√√

√√ easy and well-supported.

√ possible, not well-supported.

~ difficult to implement.

# based on MATLAB (requires a license).

27 in total

1. Stimulus ensemble and cortical layer determine V1 spatial receptive fields.

Authors: Chun-I Yeh; Dajun Xing; Patrick E Williams; Robert M Shapley
Journal: Proc Natl Acad Sci U S A Date: 2009-08-17 Impact factor: 11.205

2. Open source modules for tracking animal behavior and closed-loop stimulation based on Open Ephys and Bonsai.

Authors: Alessio Paolo Buccino; Mikkel Elle Lepperød; Svenn-Arne Dragly; Philipp Häfliger; Marianne Fyhn; Torkel Hafting
Journal: J Neural Eng Date: 2018-06-27 Impact factor: 5.379

3. Rigbox: An Open-Source Toolbox for Probing Neurons and Behavior.

Authors: Jai Bhagat; Miles J Wells; Kenneth D Harris; Matteo Carandini; Christopher P Burgess
Journal: eNeuro Date: 2020-07-09

4. Real-time, low-latency closed-loop feedback using markerless posture tracking.

Authors: Gary A Kane; Gonçalo Lopes; Jonny L Saunders; Alexander Mathis; Mackenzie W Mathis
Journal: Elife Date: 2020-12-08 Impact factor: 8.140

5. Engagement of neural circuits underlying 2D spatial navigation in a rodent virtual reality system.

Authors: Dmitriy Aronov; David W Tank
Journal: Neuron Date: 2014-10-22 Impact factor: 17.173

6. Using high-fidelity virtual reality to study perception in freely moving observers.

Authors: Peter Scarfe; Andrew Glennerster
Journal: J Vis Date: 2015 Impact factor: 2.240

7. Bonsai: an event-based framework for processing and controlling data streams.

Authors: Gonçalo Lopes; Niccolò Bonacchi; João Frazão; Joana P Neto; Bassam V Atallah; Sofia Soares; Luís Moreira; Sara Matias; Pavel M Itskov; Patrícia A Correia; Roberto E Medina; Lorenza Calcaterra; Elena Dreosti; Joseph J Paton; Adam R Kampff
Journal: Front Neuroinform Date: 2015-04-08 Impact factor: 4.081

8. Virtual reality for freely moving animals.

Authors: John R Stowers; Maximilian Hofbauer; Renaud Bastien; Johannes Griessner; Peter Higgins; Sarfarazhussain Farooqui; Ruth M Fischer; Karin Nowikovsky; Wulf Haubensak; Iain D Couzin; Kristin Tessmar-Raible; Andrew D Straw
Journal: Nat Methods Date: 2017-08-21 Impact factor: 28.547

9. Neurons that regulate mouse torpor.

Authors: Sinisa Hrvatin; Senmiao Sun; Oren F Wilcox; Hanqi Yao; Aurora J Lavin-Peter; Marcelo Cicconet; Elena G Assad; Michaela E Palmer; Sage Aronson; Alexander S Banks; Eric C Griffith; Michael E Greenberg
Journal: Nature Date: 2020-06-11 Impact factor: 49.962

10. Generating Stimuli for Neuroscience Using PsychoPy.

Authors: Jonathan W Peirce
Journal: Front Neuroinform Date: 2009-01-15 Impact factor: 4.081

3 in total

1. Standardized and reproducible measurement of decision-making in mice.

Authors: Valeria Aguillon-Rodriguez; Dora Angelaki; Hannah Bayer; Niccolo Bonacchi; Matteo Carandini; Fanny Cazettes; Gaelle Chapuis; Anne K Churchland; Yang Dan; Eric Dewitt; Mayo Faulkner; Hamish Forrest; Laura Haetzel; Michael Häusser; Sonja B Hofer; Fei Hu; Anup Khanal; Christopher Krasniak; Ines Laranjeira; Zachary F Mainen; Guido Meijer; Nathaniel J Miska; Thomas D Mrsic-Flogel; Masayoshi Murakami; Jean-Paul Noel; Alejandro Pan-Vazquez; Cyrille Rossant; Joshua Sanders; Karolina Socha; Rebecca Terry; Anne E Urai; Hernando Vergara; Miles Wells; Christian J Wilson; Ilana B Witten; Lauren E Wool; Anthony M Zador
Journal: Elife Date: 2021-05-20 Impact factor: 8.713

2. Feature selectivity can explain mismatch signals in mouse visual cortex.

Authors: Tomaso Muzzu; Aman B Saleem
Journal: Cell Rep Date: 2021-10-05 Impact factor: 9.423

3. Plasticity in visual cortex is disrupted in a mouse model of tauopathy.

Authors: Amalia Papanikolaou; Fabio R Rodrigues; Joanna Holeniewska; Keith G Phillips; Aman B Saleem; Samuel G Solomon
Journal: Commun Biol Date: 2022-01-20

3 in total