Qi Guo1, Zhujun Shi2, Yao-Wei Huang1,3, Emma Alexander4, Cheng-Wei Qiu3, Federico Capasso5, Todd Zickler1. 1. John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138. 2. Department of Physics, Harvard University, Cambridge, MA 02138; capasso@seas.harvard.edu zhujunshi@g.harvard.edu. 3. Department of Electrical and Computer Engineering, National University of Singapore, 117580 Singapore. 4. Department of Electrical Engineering and Computer Science, University of California, Berkeley, CA 94720. 5. John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138; capasso@seas.harvard.edu zhujunshi@g.harvard.edu.
Abstract
Jumping spiders (Salticidae) rely on accurate depth perception for predation and navigation. They accomplish depth perception, despite their tiny brains, by using specialized optics. Each principal eye includes a multitiered retina that simultaneously receives multiple images with different amounts of defocus, and from these images, distance is decoded with relatively little computation. We introduce a compact depth sensor that is inspired by the jumping spider. It combines metalens optics, which modifies the phase of incident light at a subwavelength scale, with efficient computations to measure depth from image defocus. Instead of using a multitiered retina to transduce multiple simultaneous images, the sensor uses a metalens to split the light that passes through an aperture and concurrently form 2 differently defocused images at distinct regions of a single planar photosensor. We demonstrate a system that deploys a 3-mm-diameter metalens to measure depth over a 10-cm distance range, using fewer than 700 floating point operations per output pixel. Compared with previous passive depth sensors, our metalens depth sensor is compact, single-shot, and requires a small amount of computation. This integration of nanophotonics and efficient computation brings artificial depth sensing closer to being feasible on millimeter-scale, microwatts platforms such as microrobots and microsensor networks.
Jumping spiders (Salticidae) rely on accurate depth perception for predation and navigation. They accomplish depth perception, despite their tiny brains, by using specialized optics. Each principal eye includes a multitiered retina that simultaneously receives multiple images with different amounts of defocus, and from these images, distance is decoded with relatively little computation. We introduce a compact depth sensor that is inspired by the jumping spider. It combines metalens optics, which modifies the phase of incident light at a subwavelength scale, with efficient computations to measure depth from image defocus. Instead of using a multitiered retina to transduce multiple simultaneous images, the sensor uses a metalens to split the light that passes through an aperture and concurrently form 2 differently defocused images at distinct regions of a single planar photosensor. We demonstrate a system that deploys a 3-mm-diameter metalens to measure depth over a 10-cm distance range, using fewer than 700 floating point operations per output pixel. Compared with previous passive depth sensors, our metalens depth sensor is compact, single-shot, and requires a small amount of computation. This integration of nanophotonics and efficient computation brings artificial depth sensing closer to being feasible on millimeter-scale, microwatts platforms such as microrobots and microsensor networks.
Visual depth sensors combine cameras, computational algorithms, and sometimes light sources to sense the 3-dimensional shapes of surrounding objects and scenes. Lidar systems (1), time-of-flight cameras (2–8), and structured lighting systems (9, 10) are examples of depth sensors that use active light sources, whereas binocular stereo systems (11) and light-field cameras (12–14) are examples that are passive, relying solely on the ambient light that happens to be available. These approaches have found widespread use on autonomous vehicles, drones, mobile phones, and many other platforms. However, they require either active lighting or iterative computation and optimization, and are thus not well suited to low-power platforms, such as mobile sensor networks and robotic insects (15–17), which impose much more severe constraints on size, weight, and power consumption.Alternative methods that utilize optical defocus to measure depth have been demonstrated to potentially greatly reduce the amount of depth computation and require no active lighting (18, 19). These algorithms (18–23) compute depth by comparing 2 differently defocused images of the same scene and produce a depth map, comprising a depth value at each pixel. However, one major challenge with this method is the optics. With conventional optical components, capturing 2 differently defocused images usually requires making physical changes to the optical system, such as reducing or enlarging its aperture (20, 21, 23) or deforming its lens (19). This not only adds significant complexity to the system control, but also fundamentally limits the depth-sensing performance by introducing unwanted delays and motion artifacts. Some previous algorithms use look-up tables (20) or iterative methods (22) in their framework to measure depth. However, these methods are hard to implement in a differentiable manner and rely on exhaustive search, instead of using gradient-based search methods, to determine the required parameters.To address these challenges, we introduce the metalens depth sensor. It is compact, static, single-shot, and requires low computational power. Thanks to the versatile wavefront-shaping capability of metalenses, ultrathin nanophotonic components that can tailor arbitrary optical wavefront at a subwavelength scale, our device can simultaneously capture 2 differently defocused images through the same aperture without having to make physical changes to the optical system. It avoids the artifacts usually incurred by reimaging over time while changing a camera’s optics and can potentially improve the depth sensor’s time resolution. Besides, the image-processing algorithm is completely differentiable, which enables data-driven, gradient-based calibration of the computational parameters compared to the nondifferentiable methods (20, 22).The working principle is inspired by the eyes of jumping spiders (Salticidae), which use defocus to succeed at sensing depth, despite the fact that their brains are about as small as poppy seeds (24). Each of the spider’s principal eyes includes a specialized structure (25) with stacked translucent retinae that simultaneously observe the world with different amounts of optical defocus (Fig. 1). Behavioral experiments have shown that the input from 1 principal eye suffices for a jumping spider to sense depth accurately enough to leap onto prey from distances of several body lengths and that depth perception can be predictably manipulated by changing the ambient light spectrum in a way that distorts the optical defocus (26).
Fig. 1.
Jumping spider and metalens depth sensor. (A) Jumping spiders can sense depth using either 1 of their 2 front-facing principal eyes (highlighted). Unlike the single retina found in human eyes, jumping spiders have multiple retinae that are layered and semitransparent. The layered-retinae structure can simultaneously measure multiple images of the same scene with different amounts of defocus, and behavioral evidence suggests that spiders measure depth using the defocus cues that are available in these images (26). (B) The metalens depth sensor estimates depth by mimicking the jumping spider. It uses a metalens to simultaneously capture 2 images with different defocus, and it uses efficient calculations to produce depth from these images. The jumping spider’s depth perception operates normally under green light (26), and we similarly designed the metalens to operate at a wavelength of 532 nm. We coupled the metalens with a spectral filter to limit the spectral bandwidth and with a rectangular aperture to prevent overlap between the 2 adjacent images. The images depicted on the photosensor were taken from experiments and show 2 fruit flies located at different distances. The corresponding depth map computed by the sensor is shown on the right, with color used to represent object distance. The closer and farther flies are colored red and blue, respectively.
Jumping spider and metalens depth sensor. (A) Jumping spiders can sense depth using either 1 of their 2 front-facing principal eyes (highlighted). Unlike the single retina found in human eyes, jumping spiders have multiple retinae that are layered and semitransparent. The layered-retinae structure can simultaneously measure multiple images of the same scene with different amounts of defocus, and behavioral evidence suggests that spiders measure depth using the defocus cues that are available in these images (26). (B) The metalens depth sensor estimates depth by mimicking the jumping spider. It uses a metalens to simultaneously capture 2 images with different defocus, and it uses efficient calculations to produce depth from these images. The jumping spider’s depth perception operates normally under green light (26), and we similarly designed the metalens to operate at a wavelength of 532 nm. We coupled the metalens with a spectral filter to limit the spectral bandwidth and with a rectangular aperture to prevent overlap between the 2 adjacent images. The images depicted on the photosensor were taken from experiments and show 2 fruit flies located at different distances. The corresponding depth map computed by the sensor is shown on the right, with color used to represent object distance. The closer and farther flies are colored red and blue, respectively.Inspired by the specialized compact optical structure of the jumping spider’s principal eye, we propose to use metasurface technology (27–34) to simultaneously collect a pair of differently defocused images using the 2 halves of a single planar photosensor (Fig. 1). Metasurfaces are ultrathin planar optical components consisting of subwavelength-spaced nanostructures patterned at an interface (27). By engineering the shape of individual nanostructures, one can control the phase, amplitude, and polarization of the transmitted wavefront at subwavelength scales, allowing multiple functions to be multiplexed within a single device. Metasurfaces have enabled a variety of optical devices with capabilities that surpass those of conventional refractive or diffractive elements, ranging from high-performance imaging lenses (metalenses) (29, 35) to novel polarization holograms (36). In our prototype sensor, we encode 2 complementary lens phase profiles with distinct focal lengths and lateral offsets on a shared aperture in a single metalens by spatial multiplexing (28). In this way, 2 differently defocused images can be captured simultaneously side by side on the photosensor in a single shot. We design the metalens focal lengths together using a depth-reconstruction algorithm, so that accurate depth maps can be computed from the 2 simultaneous images with calculations that are spatially localized and few in number—i.e., depth computations for each image pixel involve only a small spatial neighborhood of pixels and require no additional correspondence search after initial calibration.Our prototype produces depth values over a range of 10 cm from single-shot measurements using a millimeter-scale metalens. Calculating depth at each output pixel of the depth map requires fewer than 700 floating point operations (FLOPs) and involves the digitized intensity values in only a 25 25 spatial neighborhood of pixels. This integration of nanophotonics and efficient computation brings artificial depth sensing closer to being feasible on millimeter-scale, microwatts platforms such as microrobots and microsensor networks.
Principle
We model the image formed on a photosensor as the convolution of the camera point spread function (PSF) with the magnified, all-in-focus object pattern as it would be observed with a pinhole camera. The camera PSF is the image captured on the photosensor when the object is a point light source. The width of the PSF depends on the optics and the distance Z between the object and the lens. For an ideal thin-lens camera, depicted in Fig. 2, the PSF width is related to object distance by the thin-lens equation:where is the distance between the lens and the photosensor, is the radius of the entrance pupil, and is the in-focus distance—i.e., the distance for which the PSF width is equal to zero. () On the right side of Eq. , all quantities but the object distance Z are known quantities determined by the optical system. Thus, for a calibrated camera, determining the PSF width is equivalent to measuring object distance .
Fig. 2.
Operating principle. (A) A conventional thin-lens camera, in which the PSF width on the photosensor is determined by the optics and the depth (the object distance) according to the lens equation (Eq. ). is the distance between the lens and the photosensor. is the in-focus distance. is the entrance pupil (lens) radius. The solid black curve next to the photosensor represents a vertical cut of the PSF , which is drawn here with a Gaussian shape. (B) The metalens depth sensor encodes the phase profiles of 2 thin lenses in 1 aperture. The 2 effective lenses have distinct in-focus distances () (red and blue) and off-axis alignments that create 2 adjacent images () with different PSF widths (). The effective image centers are shifted from the optical axis by . The dashed red and blue curves next to the metalens show the transmitted wavefronts. Due to spatial multiplexing, the overall phase profile is highly discontinuous and therefore cannot be easily achieved with conventional (Fresnel) diffractive optical elements. (C) From a pair of input images (), a small set of calculations was used to produce the depth at each pixel across the image, generating a depth map according to Eq. . A confidence map that indicates the precision of the depth prediction at each pixel was computed alongside, according to Eq. . The computation flows from left to right, beginning with the per-pixel mean and difference ; Laplacian of the average image computed by convolving the average image with a discrete Laplacian filter; and convolution with a band-pass filter F to attenuate noise and vignetting. From and , the depth and confidence map and were computed by Eqs. and . Parameters , and were determined by the optics and were precalibrated. To eliminate large errors in the depth map, we thresholded it by showing only pixels with confidence values greater than 0.5.
Operating principle. (A) A conventional thin-lens camera, in which the PSF width on the photosensor is determined by the optics and the depth (the object distance) according to the lens equation (Eq. ). is the distance between the lens and the photosensor. is the in-focus distance. is the entrance pupil (lens) radius. The solid black curve next to the photosensor represents a vertical cut of the PSF , which is drawn here with a Gaussian shape. (B) The metalens depth sensor encodes the phase profiles of 2 thin lenses in 1 aperture. The 2 effective lenses have distinct in-focus distances () (red and blue) and off-axis alignments that create 2 adjacent images () with different PSF widths (). The effective image centers are shifted from the optical axis by . The dashed red and blue curves next to the metalens show the transmitted wavefronts. Due to spatial multiplexing, the overall phase profile is highly discontinuous and therefore cannot be easily achieved with conventional (Fresnel) diffractive optical elements. (C) From a pair of input images (), a small set of calculations was used to produce the depth at each pixel across the image, generating a depth map according to Eq. . A confidence map that indicates the precision of the depth prediction at each pixel was computed alongside, according to Eq. . The computation flows from left to right, beginning with the per-pixel mean and difference ; Laplacian of the average image computed by convolving the average image with a discrete Laplacian filter; and convolution with a band-pass filter F to attenuate noise and vignetting. From and , the depth and confidence map and were computed by Eqs. and . Parameters , and were determined by the optics and were precalibrated. To eliminate large errors in the depth map, we thresholded it by showing only pixels with confidence values greater than 0.5.The PSF width determines the amount of image blur. An object appears sharp when its distance is equal to the in-focus distance because then the PSF width is zero (under ray optics approximation). Conversely, when the object distance deviates from , the PSF width, , is nonzero, and the image is blurry. Recovering the PSF width (and thus depth) from a single blurry image is ill-posed without prior information about the underlying object pattern. However, when a second image of the same scene is captured with a different amount of blur, the width can be determined directly from the contrast change between the 2 images. One way to understand this is to assume that the PSFs can be approximated as Gaussian functions that depend on the PSF width :where is the pixel position on the photosensor. Gaussian functions have the property that partial derivatives with respect to width and location satisfyand because the defocused image is the convolution of the PSF and the all-in-focus object pattern (which does not depend on ), the same relationship between derivatives applies to the captured image:Eq. indicates that (and thus depth Z through Eq. ) can be determined directly from the spatial Laplacian of the image and the differential change of intensity with respect to varying PSF width (18, 19). The latter can be estimated via a finite difference, i.e., , where is the change of image intensity induced by a small, known variation of the PSF width (). According to Eq. , since in general no control can be made over the object distance , the only way to change the PSF width when shooting an object is to vary the parameters of the optical system, such as the sensor distance or the in-focus distance .A jumping spider’s principal eye can use its transparent layered retinae to simultaneously measure the (minimally) 2 images that are required to compute the finite difference (Fig. 1) because these retinae effectively capture images with different sensor distances . In contrast, we design a metalens that generates 2 images (, ) side by side (Fig. 1), with the images being equivalent to images that are captured with different in-focus distances (, ) through the same pupil. We design the in-focus distances so that the difference in blur between the images, , is small and approximately differential. From these 2 images (Fig. 2 , Left), we compute the per-pixel difference and the image Laplacian . The latter is obtained by convolving the averaged image with a Laplacian filter, denoted . To reduce the effects of sensor noise and optical nonidealities like vignetting, and are spatially convolved with a purposefully designed linear filter . To the lowest order, the filter is similar to a Gaussian filter that averages over neighboring pixels (). The filtered results and are shown in Fig. 2 , Center. Finally, we combine Eqs. and to calculate the depth at each pixel :with and being constants that are determined by the optics. The correctness of Eq. follows from the fact that Eq. still holds when values and are replaced by their filtered versions and .In practice, even with filtering, random noise in the captured images (, ) results in errors in the measured depth . The error can be quantified in terms of the SD of the measured depth at each pixel, which can be approximated by the measurable quantitywith constants that are determined by the optics (). This measurable quantity can serve as an indicator of the reliability of the measured depth at each pixel . For convenience, we normalize the values of to the range and define this normalized value as the confidence . A higher confidence value at pixel location indicates a smaller value of and a more accurate depth measurement (). Physically, the confidence characterizes the expected accuracy of the measurement at each pixel : A larger confidence value at a pixel indicates a statistically smaller error in the depth measurement.Since Eq. calculates depth using simple, local calculations, it fails in regions of the images that have uniform intensity and thus no measurable contrast for and . To automatically identify the locations of these failures, we use the confidence score as a criterion, and we report depth only at pixels whose confidence is above a certain threshold. The choice of confidence threshold affects the depth resolution, which we define as the smallest depth difference that can be resolved within a certain confidence range. In this paper, with a confidence threshold of 0.5, we achieve a depth resolution of about of the object distance over the distance range (see Fig. 4).
Fig. 4.
Performance analysis. (A) PSFs corresponding to the 2 images (, ), measured by using green LED point light sources placed at different distances in front of the metalens. A spectral filter was used to limit the light bandwidth (10-nm bandwidth centered at 532 nm). The asymmetry in the PSFs results from chromatic aberration and can be eliminated by using a monochromatic laser source (). (B) Depth measured by the metalens sensor as a function of known object distance. Different colors correspond to different confidence thresholds. The solid curves are the mean of the measured depth over many different object points that are located at the same known distance. The upper and lower boundaries of the shaded regions are the corresponding mean deviations of measured depth . In obtaining both and , only pixels whose confidence values are above the threshold are counted. The mean deviation is thus smaller for larger confidence threshold. The solid black line represents the ideal depth measurements (i.e., those equal to the known distances), and the dashed black lines represent relative differences between the measured depth and the known object distance. Within the distance range of 0.3 to 0.4 m, the measured depth is close to the ideal measurements. The mean deviation over this range is around of the object distances, for a confidence threshold of 0.5. Beyond this range, the measured depth trends toward constant values that do not depend on object distance, as indicated by plateaus on the left and right. At these distances, the captured images , are too blurry to provide useful contrast information for the depth measurement.
The complete sequence of calculations for the depth map and confidence map is depicted in Fig. 2. For visualization, the depth map is thresholded by confidence to show only the depth at pixels where the latter is greater than 0.5.
Metalens Design and Characterization
The metalens is designed to incorporate phase profiles of 2 off-axis lenses with different in-focus distances on a shared aperture. For each off-axis lens, the required phase profile is an offset convex shape determined by in-focus distances , sensor distance , and transverse displacement of the image center (Fig. 2):Here, indicates location on the metalens. The overall phase profile is achieved by spatially interleaving the 2, and , on the metalens at a subwavelength scale. The design specifications are in .The required phase profile can be wrapped to (i.e., modulo ) without changing its functionality. Therefore, the key requirement for precise wavefront shaping is to control locally the phase between 0 and . Here, we use titanium dioxide () nanopillars as the building blocks for local phase control. Physically, the nanopillars function as truncated waveguides and impart a phase shift to the transmitted light. By varying the pillar width, one can tune their effective refractive index, and thus the phase shift. Fig. 3 shows that by changing the pillar width () from 90 to 190 nm, one can achieve 0 to phase coverage, while maintaining a high transmission efficiency. The nanopillars have a uniform height () of 600 nm and can be fabricated with a single-step lithography. The center-to-center distance () between the neighboring nanopillars is 230 nm, smaller than half the operating wavelength. This allows us to spatially interleave different phase profiles at a subwavelength scale, which is essential to eliminating unwanted higher-order diffractions.
Fig. 3.
Metalens design. (A) Transmission efficiency and phase shift as a function of the nanopillar width. A, Inset shows the schematic of the metalens building block: a square titanium dioxide () nanopillar on a glass substrate. Pillar height: = 600 nm. Lattice unit cell size (center-to-center distance between neighboring nanopillars): = 230 nm. By varying the pillar width () from 90 to 190 nm, the phase shift changes from 0 to , and the transmission remains high. (B) Top-view SEM image of the right portion of a fabricated metalens. (Scale bar: 2 m.) (C) Enlarged view of the highlighted region in B, with nanopillars corresponding to the 2 lens-phase profiles marked with red and blue. (Scale bar: 500 nm.) (D) Side-view SEM image of the edge of the metalens showing that the nanopillars have vertical sidewalls. (Scale bar: 200 nm.)
Metalens design. (A) Transmission efficiency and phase shift as a function of the nanopillar width. A, Inset shows the schematic of the metalens building block: a square titanium dioxide () nanopillar on a glass substrate. Pillar height: = 600 nm. Lattice unit cell size (center-to-center distance between neighboring nanopillars): = 230 nm. By varying the pillar width () from 90 to 190 nm, the phase shift changes from 0 to , and the transmission remains high. (B) Top-view SEM image of the right portion of a fabricated metalens. (Scale bar: 2 m.) (C) Enlarged view of the highlighted region in B, with nanopillars corresponding to the 2 lens-phase profiles marked with red and blue. (Scale bar: 500 nm.) (D) Side-view SEM image of the edge of the metalens showing that the nanopillars have vertical sidewalls. (Scale bar: 200 nm.)The metalens is fabricated with a technique demonstrated by Devlin et al. (34). Fig. 3 show the scanning electron microscope (SEM) images of a fabricated sample. The location of the imaged area on the metalens is in . The phase wrapping introduces a discontinuity at locations where the phase profile equals an integer number of —i.e., the “zone” boundaries. This corresponds to an abrupt change of nanopillar arrangement, as shown in Fig. 3 . The phase profiles of 2 off-axis lenses have different zone spacing and orientation, corresponding to the 2 nearly vertical boundaries and the diagonal boundary, respectively (Fig. 3 and ). The spatial multiplexing scheme is illustrated explicitly in Fig. 3, with nanopillars belonging to different focusing profiles highlighted in different colors.
Results
We built a prototype metalens depth sensor by coupling the metalens with off-the-shelf components. The sensor’s current size, including mechanical components such as optical mounts, is 4 4 10 cm, but since the metalens is only 3 mm in diameter, the overall size of the assembled sensor could be reduced substantially with a purpose-built photosensor and housing. We paired a 10-nm bandpass filter with the metalens, which is designed for monochromatic operation at 532 nm. A rectangular aperture was placed in front of the metalens to limit the field of view and prevent the 2 images from overlapping. The blur change between the 2 images can be seen in Fig. 4, which shows PSFs for each of the 2 images [, ] that were measured by using a green light-emitting diode (LED) mated to a 10-m-diameter pinhole and placed at different depths Z along the optical axis. The PSFs are more disc-like than Gaussian, and they are asymmetric due to off-axis chromatic aberration. ( shows that this asymmetry disappears under monochromatic laser illumination.)Performance analysis. (A) PSFs corresponding to the 2 images (, ), measured by using green LED point light sources placed at different distances in front of the metalens. A spectral filter was used to limit the light bandwidth (10-nm bandwidth centered at 532 nm). The asymmetry in the PSFs results from chromatic aberration and can be eliminated by using a monochromatic laser source (). (B) Depth measured by the metalens sensor as a function of known object distance. Different colors correspond to different confidence thresholds. The solid curves are the mean of the measured depth over many different object points that are located at the same known distance. The upper and lower boundaries of the shaded regions are the corresponding mean deviations of measured depth . In obtaining both and , only pixels whose confidence values are above the threshold are counted. The mean deviation is thus smaller for larger confidence threshold. The solid black line represents the ideal depth measurements (i.e., those equal to the known distances), and the dashed black lines represent relative differences between the measured depth and the known object distance. Within the distance range of 0.3 to 0.4 m, the measured depth is close to the ideal measurements. The mean deviation over this range is around of the object distances, for a confidence threshold of 0.5. Beyond this range, the measured depth trends toward constant values that do not depend on object distance, as indicated by plateaus on the left and right. At these distances, the captured images , are too blurry to provide useful contrast information for the depth measurement.To suppress the effects of noise and imaging artifacts in images (, ), and to increase the number of high-confidence pixels in the output maps, we computed 9 separate depth and confidence maps using 9 instances of Eqs. and that have distinct and complementary spatial filters , and then we fused these 9 “channels” into 1. We also designed a calibration procedure that tuned the parameters simultaneously, using back-propagation and gradient descent (). In addition to being user-friendly, this end-to-end calibration has the effect of adapting the computation to the shapes of the metalens PSFs, which differ substantially from Gaussians.To analyze the depth accuracy, we measured the depths of test objects at a series of known distances and compared them with the true object distances. The test objects were textured planes oriented parallel to the lens plane. At each object distance, the mean deviation of depth, , was computed by using pixels that have confidence values greater than a threshold. Fig. 4 shows the measured depth for different confidence thresholds as a function of object distance. For a confidence threshold of 0.5, the measured depth was accurate to within a mean deviation below or around of the true depth, over a range of true object distances between 0.3 and 0.4 m. Beyond this range, the measured depth defaulted to the extreme depth value that the system can predict, as indicated by plateaus on the left and right ends. This indicated that the 2 images were so blurry that there was an insufficient contrast difference between them.Fig. 5 shows depth maps for a variety of scenes. Because it uses a single shot, the metalens depth sensor can measure objects that move, such as the fruit flies and water stream in Fig. 5 . It can also measure the depth of translucent entities, such as the candle flames of Fig. 5, that cannot typically be measured by using active sensors like Lidar and time-of-flight. Fig. 5 shows a slanted plane with printed text, where the blur change between the 2 images is particularly apparent. In general, the sensor reports a larger number of depth measurements near regions with edges and texture, whereas regions with uniform intensity and low contrast are typically discarded as having low confidence values. Note that the blur differences between and are visually apparent in Fig. 5 , but that the system still succeeds in Fig. 5, where the differences in blur are hard to discern.
Fig. 5.
Input images and output depth maps. The sensor produces real-time depth and confidence maps of 400 400 pixels at 100 frames per second. (A and B) It can measure fast-moving objects such as fruit flies (A) and water streams (B) because the 2 images (, ) are captured in a single shot instead of sequentially over time. (B and C) It can also measure translucent structures such as water streams (B) and flames (C) because it relies only on ambient light instead of reflections from a controlled light source. (D) A slanted plane with text expresses the difference in defocus between the 2 images (, ). Color bar is in meters. The images and depth map for scenes A, B, and D were produced by illuminating the scenes with a green LED. Depth maps were thresholded at confidence greater than 0.5, which is the threshold that yields a mean deviation of about of object distance between 0.3 and 0.4 m in Fig. 4. Additional images and videos are available in .
Input images and output depth maps. The sensor produces real-time depth and confidence maps of 400 400 pixels at 100 frames per second. (A and B) It can measure fast-moving objects such as fruit flies (A) and water streams (B) because the 2 images (, ) are captured in a single shot instead of sequentially over time. (B and C) It can also measure translucent structures such as water streams (B) and flames (C) because it relies only on ambient light instead of reflections from a controlled light source. (D) A slanted plane with text expresses the difference in defocus between the 2 images (, ). Color bar is in meters. The images and depth map for scenes A, B, and D were produced by illuminating the scenes with a green LED. Depth maps were thresholded at confidence greater than 0.5, which is the threshold that yields a mean deviation of about of object distance between 0.3 and 0.4 m in Fig. 4. Additional images and videos are available in .For scenes other than Fig. 5, we used green LED light sources, and the overall transmission efficiency of the metalens plus the bandpass filter was around . For sunlight illumination, the bandpass filter transmitted around of the visible light of the solar spectrum. The absolute irradiance that supports the function of the sensor varied based on the sensitivity of the photosensor that was used and can be estimated from specifications including absolute sensitivity threshold, dynamic range, etc. For our experimental setup, the irradiance at the aperture was estimated to be between 0.3 and 0.5 W/m2 within the working bandwidth to support the function of the sensor.The sensor generates depth and confidence maps of 400400 pixels at more than 100 frames per second using a combined central processing unit and graphics processing unit (Intel i5 8500k and NVIDIA TITAN V). It could be accelerated substantially by optimizing the code and/or the hardware because the calculations are spatially localized and few in number. Producing the depth and confidence values at each output pixel required 637 FLOPs and involved only the 25 25 surrounding pixels. For context, an efficient implementation of a binocular stereo algorithm requires about 7,000 FLOPs per output pixel (37), and a system-on-chip implementation of the well-known Lucas–Kanade optical flow algorithm (with spatial dependence similar to that of our sensor) requires over 2,500 FLOPs per pixel (38).
Discussion
The metalens depth sensor inherits some of the limitations that exist in the vision system of jumping spiders, such as a limited spectral bandwidth and a limited field of view. However, these limits are not fundamental, and they can be alleviated by more sophisticated metalens designs. The spectral bandwidth can be expanded by using achromatic metalenses (35, 39, 40), which also improve light efficiency. The field of view can be improved, for example, by using metalens nanopillars that are sensitive to polarization to induce 2 differently focused images that are superimposed on the sensor plane with orthogonal polarizations (36) and then transducing the 2 images with a spatially multiplexed, polarization-sensitive sensor array. This would effectively trade spatial resolution and light efficiency for an increase in field of view.The proposed computational algorithm produces a dense field of depth estimates that are each associated with a confidence value. The confidence is essential for the users of the depth sensor to remove unreliable predictions. It also uses a multiscale filtering approach to handle image textures at different spatial frequency and takes advantage of the confidence to merge all different spatial scales together, compared to previous methods (20, 22) that only use filters at a single, predetermined spatial scale. The proposed algorithm does not incorporate inference-based methods such as Markov random fields (MRFs) or conditional random fields (CRFs) that could exploit longer-range coherence between depth values across the field of view. Instead, the depth and confidence estimations at each pixel are only based on information of its spatial neighborhood. The advantages of this design choice are flexibility and generality. For tasks that require high speeds, the output can be used as-is, with simple thresholding of confidence values. For tasks that require higher accuracy and fewer holes in the depth map, the current output can be fed into an MRF/CRF (or any other spatial regularizer) that is appropriate for that task. Moreover, because the pipeline is end-to-end differentiable, its parameters can be fine-tuned in conjunction with MRF/CRF parameters to optimize performance on the specific task.By combining cutting-edge nanotechnology and computer vision algorithms, this work introduces a passive snapshot depth sensor that mimics some of the capabilities of a jumping spider. The sensor’s small volume, weight, and computation (i.e., power) bring depth-sensing capabilities closer to being feasible on insect-scale platforms, such as microrobots, ingestible devices, far-flung sensor networks, and small wearable devices. Combinations of nanophotonics and efficient computation that are different from the ones in this paper might lead to other forms of compact visual sensors, and this is an area that remains relatively unexplored (41).
Authors: J P Balthasar Mueller; Noah A Rubin; Robert C Devlin; Benedikt Groever; Federico Capasso Journal: Phys Rev Lett Date: 2017-03-14 Impact factor: 9.161
Authors: Tsz Kit Yung; Jiawei Xi; Hong Liang; Kai Ming Lau; Wai Chun Wong; Randy Stefan Tanuwijaya; Fan Zhong; Hui Liu; Wing Yim Tam; Jensen Li Journal: iScience Date: 2022-03-24