Yu Li1, Fan Xu2, Fa Zhang2, Pingyong Xu3, Mingshu Zhang3, Ming Fan4, Lihua Li4, Xin Gao1, Renmin Han1. 1. King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, Saudi Arabia. 2. High Performance Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. 3. Key Laboratory of RNA Biology, Institute of Biophysics, Chinese Academy of Sciences, Beijing, China. 4. Institute of Biomedical Engineering and Instrumentation, Hangzhou Dianzi University, Hangzhou, China.
Abstract
Motivation: Super-resolution fluorescence microscopy with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures. Results: Here, we propose a novel deep learning guided Bayesian inference (DLBI) approach, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster. Availability and implementation: The main program is available at https://github.com/lykaust15/DLBI. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Super-resolution fluorescence microscopy with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures. Results: Here, we propose a novel deep learning guided Bayesian inference (DLBI) approach, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster. Availability and implementation: The main program is available at https://github.com/lykaust15/DLBI. Supplementary information: Supplementary data are available at Bioinformatics online.
Fluorescence microscopy with a resolution beyond the diffraction limit of light (i.e. super-resolution) has played an important role in biological sciences. The application of super-resolution fluorescence microscope techniques to living-cell imaging promises dynamic information on complex biological structures with nanometer-scale resolution.Recent development of fluorescence microscopy takes advantages of both the development of optical theories and computational methods. Living cell stimulated emission depletion (STED) (Hein ), reversible saturable optical linear fluorescence transitions (RESOLFT) (Schwentker ) and structured illumination microscopy (SIM) (Gustafsson, 2005) mainly focus on the innovation of instruments, which requires sophisticated, expensive optical setups and specialized expertise for accurate optical alignment. The time-series analysis based on localization microscopy techniques, such as photoactivatable localization microscopy (PALM) (Hess ) and stochastic optical reconstruction microscopy (STORM) (Rust ), is mainly based on the computational methods, which build a super-resolution image from the localized positions of single molecules in a large number of images. Though compared with STED, RESOLFT and SIM, PALM and STORM do not need specialized microscopes, the localization techniques of PALM and STORM require the fluorescence emission from individual fluorophores to not overlap with each other, leading to long imaging time and increased damage to live samples (Lippincott-Schwartz and Manley, 2009). More recent methods (Holden ; Huang ; Quan ; Zhu ) alleviate the long exposure problem by developing multiple-fluorophore fitting techniques to allow relatively dense fluorescent data, but still do not solve the problem completely.Bayesian-based time-series analysis of high-density fluorescent images (Cox ; Xu , 2017) further pushes the limit. By using data from overlapping fluorophores as well as information from blinking and bleaching events, it extends the super-resolution imaging to the large-field imaging of living cells. Despite its potential to resolve ultrastructures and fast cellular dynamics in living cells, several bottlenecks still remain. The state-of-the-art methods, such as Bayesian analysis of the blinking and bleaching (i.e. the 3B analysis) (Cox ), are computationally expensive, and may cause artificial thinning and thickening of structures due to local sampling. Significant improvements on runtime and accuracy have been achieved by single molecule-guided Bayesian localization microscopy (SIMBA) (Xu ) with the introduction of dual-channel fluorescent imaging and single molecule-guided Bayesian inference. However, the enhanced process is severely limited by the specialized class of proteins.Deep learning has accomplished great success in various fields, including super-resolution imaging (Kim ; Ledig ; Lim ). Among different deep learning architectures, the generative adversarial network (GAN) (Goodfellow ) achieved the state-of-the-art performance on single image super-resolution (SISR) (Ledig ). However, there are two fundamental differences between the SISR and super-resolution fluorescence microscopy. First, the input of SISR is a downsampled (i.e. low-resolution) image of a static high-resolution image and the expected output is the original image, whereas the input of super-resolution fluorescence microscopy is a time-series of low-resolution fluorescent images and the output is the high-resolution image containing estimated locations of the fluorophores (i.e. the reconstructed structure). Second, the nature of SISR ensures that there are readily a huge amount of existing data to train deep learning models, whereas for fluorescence microscopy, there are only limited time-series datasets. Furthermore, most of these datasets do not have the ground-truth high-resolution images, which make supervised deep learning infeasible.In this article, we propose a novel deep learning guided Bayesian inference (DLBI) framework, for structure reconstruction of high-resolution fluorescent microcopy. Our framework combines the strength of stochastic simulation, deep learning and statistical inference. To our knowledge, this is the first deep learning-based super-resolution fluorescent microscopy method. In particular, the stochastic simulation module simulates time-series low-resolution images from high-resolution images based on experimentally calibrated parameters of fluorophores and stochastic modeling, which provides supervised training data for deep learning models. The deep learning module takes the simulated time-series low-resolution images as inputs, captures the underlying distribution that generates the ground-truth super-resolution images by exploring local features and correlation along time-axis of the low-resolution images, and outputs a predicted high-resolution image. To achieve this goal, we develop a GAN in which a generator network and a discriminator network contest with each other. The generator network tries to learn the distribution of the high-resolution images in a multi-scale manner, whereas the discriminator network tries to discriminate the ground-truth images and the images produced by the generator network. In order to capture the deep features in the images, we further ease the degradation issue by integrating residual networks (He ) into our GAN model, where degradation means that stacking more network layers does not lead to better accuracy. The high-resolution image produced by the deep learning module is often very close to the ground-truth image. However, it can still contain some artifacts, and more importantly, lacks the physical meaning. Thus, we develop the Bayesian inference module to take the predicted high-resolution image from deep learning, run Bayesian inference from the initial locations of fluorophores in the predicted image, and predict a more accurate high-resolution image.Comprehensive experimental results on two simulated and three real-world datasets demonstrate that DLBI provides more accurate and realistic local-patch as well as large-field reconstruction than the state-of-the-art method, the 3B analysis (Cox ). Meanwhile, our method is more than 100 times faster than the 3B analysis.
2 Materials and methods
As shown in Figure 1, DLBI contains three modules: (i) stochastic simulation (Section 2.1), (ii) deep neural networks (Section 2.2) and (iii) Bayesian inference (Section 2.3).
Fig. 1.
The overall workflow of DLBI. The three modules are shown in solid boxes: the simulation module (green), the deep learning module (red) and the Bayesian inference module (blue). The training (orange) and testing (purple) procedures are shown in dashed boxes. For training, time-series low-resolution images simulated from the simulation module are used to train the deep learning module in a multi-scale manner, from 2×, 4×, to 8× resolution. For testing, given certain time-series low-resolution images, the deep learning module predicts the 2× to 8× super-resolution images, and the Bayesian inference module takes the predicted 8× image (which contains artifacts) and produces the final high-resolution image
The overall workflow of DLBI. The three modules are shown in solid boxes: the simulation module (green), the deep learning module (red) and the Bayesian inference module (blue). The training (orange) and testing (purple) procedures are shown in dashed boxes. For training, time-series low-resolution images simulated from the simulation module are used to train the deep learning module in a multi-scale manner, from 2×, 4×, to 8× resolution. For testing, given certain time-series low-resolution images, the deep learning module predicts the 2× to 8× super-resolution images, and the Bayesian inference module takes the predicted 8× image (which contains artifacts) and produces the final high-resolution imageAlthough deep learning has proved its great superiority in various fields, it has not been used for fluorescent microscopy image analysis. One of the possible reasons is the lack of supervised training data, which means the number of time-series low-resolution image datasets is limited and even for the existing datasets, the ground-truth high-resolution images are often unknown. Here, a stochastic simulation based on the experimentally calibrated parameters is designed to solve this issue, without the need of collecting a massive amount of real fluorescent images. This empowers our deep neural networks to effectively learn the latent structures under the low-resolution, high-noise and stochastic fluorescing conditions. The primitive super-resolution images produced by deep neural networks still contain artifacts and lack physical meaning, we finally develop a Bayesian inference module based on the mechanism of fluorophore switching to produce high-confident images.Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution that generates the training super-resolution images by exploring local features and correlation along time-axis, and statistical inference removes artifacts and refines the ultrastructure extracted by deep learning, and further endues physical meaning to the final image.
2.1 The stochastic simulation module
The input of our simulation module is a high-resolution image that depicts the distribution of the fluorophores and the output is a time-series of low-resolution fluorescent images with different fluorescing states. We refer the readers to Supplementary Section S1 for terminologies in fluorescence microscopy.In our simulation, Laplace-filtered natural images and sketches are used as the ground-truth high-resolution images that contain the fluorophore distribution. If a gray-scale image is given, the depicted shapes are considered as the distribution of fluorophores and each pixel value on the image is considered as the density of fluorophores at the location. We then create a number of simulated fluorophores that are distributed according to the distribution and the densities. For each fluorophore, it switches according to a Markov model, i.e. among states of emitting (activated), not emitting (inactivated), and bleached. The emitting state means that the fluorophore emits photons and a spot according to the point spread function (PSF) is depicted on the canvas. All the spots of the emitting fluorophores thus result in a high-resolution fluorescent image. Applying the Markov model on the initial high-resolution image generates a time-series of high-resolution images. After adding the background to the high-resolution images, they are downsampled to low-resolution images and noise is finally added. Figure 2 summarizes the stochastic simulation procedure.
Fig. 2.
The workflow of stochastic simulation. Firstly, a high-resolution image is inputted as the distribution and density of fluorophores. Then, the emitting of photons is simulated based on the stochastic parameters for each time frame. A random background (DC offset) is added to each image. The images are then downsampled to low-resolution and noise is added, which results in a time-series of low-resolution images
The workflow of stochastic simulation. Firstly, a high-resolution image is inputted as the distribution and density of fluorophores. Then, the emitting of photons is simulated based on the stochastic parameters for each time frame. A random background (DC offset) is added to each image. The images are then downsampled to low-resolution and noise is added, which results in a time-series of low-resolution imagesHere, the success of simulation relies on three factors: (i) the principal of the linear optical system, (ii) experimentally calibrated parameters of fluorophores and (iii) stochastic modeling.
2.1.1 Linear optics
A fluorescence microscope is considered as a linear optical system, in which the superposition principle is valid, i.e. Image (Obj1 + Obj2) = Image (Obj1) + Image (Obj2). The behavior of fluorophores is considered invariant to mutual interaction. Therefore, for high-density fluorescent images, the pixel density can be directly calculated from the light emitted from its surrounding fluorophores.When a fluorophore is activated, an observable spot can be recorded by the sensor, the shape of which is called the PSF. Considering the limitation of sensor capability, the PSF of an isotropic point source is often approximated as a Gaussian function:
where σ is calculated from the fluorophore in the specimen that specifies the width of the PSF, I0 is the peak intensity and is proportional to the photon emission rate and the single-frame acquisition time, (x0, y0) is the location of the fluorophore.While PSF describes the shape, the full width at half maximum (FWHM) describes the distinguishability. It is defined to be the half width of the maximum amplitude of PSF. If PSF is modeled as a Gaussian function, the relationship between FWHM and σ is given byConsidering the probability of linear optics, a high-density fluorescent image is composed by PSFs of the fluorophores.
2.1.2 Calibrated parameters of fluorophores
In most imaging systems, the characteristics of a fluorescent protein can be calibrated by experimental techniques. With all the calibrated parameters, it is not difficult to describe and simulate the fluorescent switching of a specialized protein.The first characteristic of a fluorophore is its switching probability. A fluorophore always transfers among three states, emitting, not emitting and bleached, which can be specified by a Markov model (Fig. 3). If the fluorophore transfers from not emitting to bleached, it will not emit any photon anymore. As linear optics, each fluorophore’s transitions are assumed to be independent.
Fig. 3.
The Markov model describing state transition of a fluorophore
The Markov model describing state transition of a fluorophoreThe second characteristic of a fluorophore is its PSF. When a real-world fluorophore is activated, the emitted photons and its corresponding PSF will not stay unchanged over time. The stochasticity of the PSF and photon strength describes the characteristics of a fluorescent protein. To simulate the fluorescence, we should not ignore these properties. Fortunately, the related parameters can be well-calibrated. The PSF and FWHM of a fluorescent protein can be measured in low molecule density. In an instrument for PALM or STORM, the PSF of the microscope can be measured by acquiring image frames, fitting the fluorescent spots parameter, normalizing and then averaging the aligned single-molecule images. The distribution of FWHM can be obtained from statistical analysis. The principle of linear optics ensures that the parameters measured in single-molecule conditions are also applicable to high-density conditions.In our simulation, a log-normal distribution (Cox ; Zhu ) is used to approximate the experimentally measured single fluorophore photon number distribution. Firstly, a table of fluorophore’s experimentally calibrated FWHM parameters is used to initialize the PSF table in our simulation, according to Equations (1) and (2). Then for each fluorophore recorded in the high-resolution image, the state of the current image frame is calculated according to the transfer table [P1, P2, P3, P4, P5] (Fig. 3) and a random PSF shape is produced if the corresponding fluorophore is at the ‘emitting’ state. This procedure is repeated for each fluorophore, which results in the final fluorescent image.Performance comparison between the 3B analysis and DLBI on the four areas of the two simulated datasets in terms of PSNR and SSIMThe best performance is shown in bold.
2.1.3 Stochastic modeling
The illumination of real-world objects is different at different time. In general, the illumination change of real-world objects can be suppressed by high-pass filtering with a large Gaussian kernel. However, this operation will sharpen the random noise and cannot remove the background or DC offset (DC offset, DC bias or DC component denotes the mean value of a signal. If the mean amplitude is zero, there is no DC offset. For most microscopy, the DC offset can be calibrated but cannot be completely removed.) To make our simulation more realistic, several stochastic factors are introduced. First, for a series of simulated fluorescent images, a background value calculated from the multiplication between a random strength factor and the average image intensity is added to the fluorescent images to simulate the DC offset. For the same time-series, the strength factor remains unchanged but the background strength changes with the image intensity. Second, the high-resolution fluorescent image is downsampled and random Gaussian noise is added to the low-resolution image. Here, the noise is also stochastic for different time-series and close to the noise strength that is measured from the real-world microscopy.The default setting of our simulation takes a 480 × 480 pixel high-resolution image as the input and simulates 200 frames of 60 × 60 pixel (i.e. binned) low-resolution images.
2.2 The deep learning module
We build a deep residual network under the GAN framework (Goodfellow ; Ledig ) to estimate the primitive super-resolution image I (the latent structure features) from time-series of low-resolution fluorescent images . Instead of building just one generative model, our approach builds a pair of models, a generator model, G, which produces the estimation of the underling structure of the training images, and a discriminator model, D, which is trained to distinguish the reconstructed super-resolution image from the ground-truth one. Figure 4 demonstrates the overview of our deep learning framework.
Fig. 4.
Overview of our deep learning framework. There are two main components of the model, the generator network (G) and the discriminator network (D). G is used to convert the time-series noisy, low-resolution images into a noise-free, super-resolution image while D is used to distinguish the ground-truth high-resolution image from the one produced by G. G and D are designed to contest each other, which are trained simultaneously. As the training goes on, both G’s ability to generate better super-resolution images and D’s ability to distinguish the generated images are improved, which results in more and more similar images to the ground-truth from G. During testing, only G is used
Overview of our deep learning framework. There are two main components of the model, the generator network (G) and the discriminator network (D). G is used to convert the time-series noisy, low-resolution images into a noise-free, super-resolution image while D is used to distinguish the ground-truth high-resolution image from the one produced by G. G and D are designed to contest each other, which are trained simultaneously. As the training goes on, both G’s ability to generate better super-resolution images and D’s ability to distinguish the generated images are improved, which results in more and more similar images to the ground-truth from G. During testing, only G is used
2.2.1 Basic concepts
The goal of training a generator neural network is to obtain the optimized parameters, θ, for the generating function, G, with the minimum difference between the output super-resolution image, I, and ground-truth, I:
where is the generated super-resolution image by G for the nth training sample, N is the number of training images, and l is a loss function that will be specified later.For the discriminator network D, D(x) represents the probability of the data being the real high-resolution image rather than from G. When training D, we try to maximize its ability to differentiate ground-truth from the generated image, to force G to learn better details. When training G, we try to minimize , which is the log likelihood of D being able to tell that the image generated by G is not ground-truth. That is, we minimax the following function:
In this way, we force the generator to optimize the generative loss, which is composed of perceptual loss, content loss and adversarial loss (more details of the loss function will be introduced in Section 2.2.3).
2.2.2 Model architecture
Our network is specialized for the analysis of time-series images through: (i) 3D filters in the neural network that take all the image frames into consideration, which extracts the time dependent information naturally, (ii) two specifically designed modules in the generator residual network, i.e. Monte Carlo dropout (Gal and Ghahramani, 2015) and denoise shortcut, to cope with the stochastic switching of fluorophores and random noise and (iii) a novel incremental multi-scale architecture and parameter tuning scheme, which is designed to suppress the error accumulation in large upscaling factor neural networks.Figure 5 illustrates the entire architecture of the generator model. The input is time-series low-resolution images. We first use a convolutional layer with the filter size as seven by seven (Here, the filter size depends on the FWHM of a PSF. Generally, a fluorescence microscope produces low-resolution images with PSF spanning pixels. The specially designed filter size can balance between the computational time and physical meaning.), which is larger than the commonly used filter, to capture meaningful features of the input fluorescence microscope images. The Monte Carlo dropout layer, which dropouts some pixels from the input feature maps during both training and testing, is applied to the output of the first layer to suppress noise. To further alleviate the noise issue, we use another technique, the denoise shortcut. It is similar to the identical shortcut in the residual network block. However, instead of being exactly the same as the input, we set each channel of the input feature map as the average of all the channels. The denoise shortcut is added to the output of the convolutional layer, which is after 16 residual blocks (see Supplementary Section S2), element-wise. After this feature map extraction process, we use a pixel shuffle layer combined with the convolutional layer to increase the dimensionality of the image gradually (upsampling Conv X2 in Fig. 5).
Fig. 5.
Architecture of the generator network, which is composed of a residual network component and a multi-scale upsampling component. The low-resolution images are firstly fed to the residual network to extract information from the original 3D space, during which denoising is performed by the Monte Carlo dropout and denoise shortcut. The extracted feature maps are fed into the multi-scale upsampling component to increase the resolution gradually. We increase the resolution by a factor of two during each upsampling, resulting in three parameter tuning interfaces. Using the three interfaces, we can use all the 2×, 4× and 8× ground-truth images to train the generator network, reducing the artifacts in the final 8× super-resolution images greatly
Architecture of the generator network, which is composed of a residual network component and a multi-scale upsampling component. The low-resolution images are firstly fed to the residual network to extract information from the original 3D space, during which denoising is performed by the Monte Carlo dropout and denoise shortcut. The extracted feature maps are fed into the multi-scale upsampling component to increase the resolution gradually. We increase the resolution by a factor of two during each upsampling, resulting in three parameter tuning interfaces. Using the three interfaces, we can use all the 2×, 4× and 8× ground-truth images to train the generator network, reducing the artifacts in the final 8× super-resolution images greatlyHere, we adopt a novel multi-scale tuning procedure to stabilize the images. As shown in Figure 5, our generator can output and thus calculate the training error of multi-scale super-resolution images, ranging from to , which means that our model has multiple training interfaces for back propagation. Thus during training, we use the high-resolution ground-truth images to tune the model simultaneously to ensure that the dimensionality of the images increases smoothly and gradually without introducing too much fake detail.For the discriminator network, we adopt the traditional convolutional neural network, which contains eight convolutional layers, one residual block and one sigmoid layer (see Supplementary Section S2). The convolutional layers increase the number of channels gradually to 2048 and then decrease it using one by one filters. Those convolutional layers are followed by a residual block, which further increases the model ability of extracting features. Supplementary Section S2 provides detailed descriptions of the generator and discriminator networks.
2.2.3 Model training and testing
GAN is known to be difficult to train (Salimans ). We use the following techniques to obtain stable models. For the generator model, we do not train GAN immediately after initialization. Instead, we pretrain the model. During the pretrain process, we minimize the mean squared error between the super-resolution image and the ground-truth, i.e. with the pixel-wise MSE loss as
where W is the width of the low-resolution image, H is the height of the low-resolution image, and is the upscaling factor. During pretraining, we optimize simultaneously, instead of optimizing the sum of them.Only after the model has been well-pretrained do we start training the GAN. During that process, we also use VGG19 (Simonyan and Zisserman, 2014) to calculate the perceptual loss (Johnson ) and use Adam optimizer (Kingma and Ba, 2014) with learning rate decay as the optimizer. When feeding an image to the VGG model, we resize the image to fulfill the dimensionality requirement:
where V is the dimensionality of the VGG embedding output.During final tuning, we simultaneously optimize the , and upscaling by the generative loss:
and
where and the upscaling has an additional term, the adversarial loss . For the discriminator network, we use the following loss function:During testing, for the same input time-series images, we run the model multiple times to get a series of super-resolution images. Because of the Monte Carlo dropout layer in the generator model, all of the super-resolution images are not identical. We then compute the average of these images as the final prediction, with another map showing the P-value of each pixel. We use Tensorflow combined with TensorLayer (Dong ) to implement the deep learning module. Trained on a workstation with one Pascal Titan X, the model gets converged in around 8 h.
2.3 The Bayesian inference module
Our Bayesian inference module takes both the time-series low-resolution images and the primitive super-resolution image produced by the deep learning module as inputs, and generates a set of optimized fluorophore locations, which are further interpreted as a high-confident super-resolution image. Since the deep learning module has already depicted the ultrastructures in the image, we use these structures as the initialization of the fluorophore locations, re-sampling with a random punishment against artifacts. For each pixel, we re-sample the fluorophore intensity by and the location by , where is the pixel value in the image produced by deep learning, rand(x, y) is limited in ±8. In this way, the extremely high illumination can be suppressed and fake structures will be re-estimated.
2.3.1 Basic concepts
As shown in Figure 3, a fluorophore has three states: emitting (light), not emitting and bleached. In classic Bayesian-based time-series analysis, the switching procedure of fluorophores is modeled by Bayesian inference, i.e. given an observed region R, deciding whether there is a fluorophore (F) or not (N) by
where P(F) and P(N) are constants which are based on experimental prior, is the probability of the observed data region R given the location of the fluorophore, is the probability of the observed data region R if there is no fluorophore, which can be calculated by integrating all the probability of observing pixels given the noise model.For a single fluorophore, the switching procedure can be modeled by a hidden Markov model (HMM) (Rabiner, 1989), as shown in Figure 6A. However, for high-density fluorophores, each fluorophore transfers the state independently with a stable probability (Cox ) and all the fluorophores together can be modeled by a factorial hidden Markov model (FHMM) (Ghahramani and Jordan, 1996), as shown in Figure 6B, which has been used and proved in (Xu ).
Fig. 6.
The Bayesian inference model used for fluorophore switching. (A) For one fluorophore, its transition between different states can be modeled by a HMM. (B) For high-density fluorophores, the observed fluorescence and their underlying transitions can be modeled by a FHMM
The Bayesian inference model used for fluorophore switching. (A) For one fluorophore, its transition between different states can be modeled by a HMM. (B) For high-density fluorophores, the observed fluorescence and their underlying transitions can be modeled by a FHMM
2.3.2 Refining results with physical meaning
Although HMM and FHMM are capable of modeling the fluorophore switching process, they are localization-guided, which often ignore the global information, and are computationally expensive to learn. Thus, we initialize the fluorophores’ locations by using the image generated by deep learning and use Bayesian inference to further refine the results.We apply the FHMM model to deal with high-density fluorescent microscopy. The parameters of FHMM are estimated by the expectation-maximization algorithm:
where the observation sequence has T frames, . The hidden states are , where each fluorophore has three possible states in the model. Q is a function of the fluorophore parameters given the current parameter estimation and the observation sequence . The procedure iterates between a step that fixes the current parameters and computes posterior probabilities over the hidden states (the E-step), and a step that uses these probabilities to maximize the expected log likelihood of the observations as a function of the parameters (the M-step).In the E-step, we fix the fluorophore parameters in the model and utilize the hybrid of Markov chain Monte Carlo and forward algorithm to sample the initial model. When a new fluorophore is determined, we take samples of this fluorophore using the forward filtering backward sampling algorithm (Godsill ). Thus, the sampled image sequence contains this fluorophore. In the M-step, we optimize the fluorophore parameters and find the maximum a posteriori (MAP) fluorophore positions using the conjugate gradient. Then, based on already known positions of fluorophores, the surrounding fluorophores with high probability are expended. The final super-resolution image is obtained by iterating these two steps until convergence.The detailed method description and parameter setting are given in Supplementary Section S3.
3 Experimental results
3.1 Training deep learning
To train our deep learning module, the stochastic simulation module was used to simulate time-series low-resolution images from 12 000 gray-scale high-resolution images. These images were downloaded from two databases: (i) 4000 natural images were downloaded from ILSVRC (Russakovsky ) and Laplace filtered and (ii) 8000 sketches were downloaded from the Sketchy Database (Sangkloy ). Note that our simulation is a generic method, which does not depend on the type of the input images. Thus any gray-scale image can be interpreted as the fluorophore distribution and used to generate the corresponding time-series low-resolution images.To initialize all the weights of the deep learning models, we used the random normal initializer with the mean as 0 and standard deviation as 0.02. As for the Monte Carlo dropout layer, we set the keep ratio as 0.8. In terms of the Adam optimizer (Kingma and Ba, 2014), we followed the setting in (Dai ; Li ) and set the learning rate as and the , which is the exponential decay rate for the first moment estimates, as 0.9. During training, we set the batch size as 8, the initialization training epoch as 2 and the GAN training epoch as 40. When performing the real GAN training, we utilized the learning rate decay technique, reducing the learning rate by half every 10 epochs.
3.2 Evaluation datasets
Two simulated datasets and three real-world datasets are used to evaluate the performance of the proposed method. Simulated datasets are used due to the availability of ground-truth.The first two datasets are simulated datasets, for which the ground-truth (i.e. high-resolution images) is downloaded from the Single-Molecule Localization Microscopy (SMLM) challenge (http://bigwww.epfl.ch/smlm/datasets/) (Sage ). The two datasets correspond to two structures: MT0.N1.HD (MT) and Tubulin ConjAL647 (Tub). For each structure, single molecule positions were downloaded and then transformed to fluorophore densities according to Section 2.1. For simulation, the photo-convertible fluorescent protein (PCFP) mEos3.2 (Zhang ) and its associated PSF, FWHM and state transfer table were used. For the convenience of calculation, we cropped the large-field structure into four separate areas, each with 480 × 480 pixels (). For each high-resolution image, 200 frames of low-resolution fluorescent images were generated, each with 60 × 60 pixels.The third dataset is a real-world dataset, which was used in recent work (Xu ). The actin was labeled with mEos3.2 (For the convenience of cellular labeling and instrument setup, here all the experiments were carried out by mEos3.2.) in U2OS cells (Actin1) and taken with an exposure time of 50 ms per image frame. The actin network is highly dynamic and exhibits different subtype structures criss-crossing at various distances and angles, including stress fibers and bundles with different sizes and diameters. The dataset has 200 frames of high-density fluorescent images, each with 249 × 395 pixels () in the green channel. This is a good benchmark set that has been well tested which can compare our method with SIMBA (Xu ), a recent Bayesian approach based on dual-channel imaging and PCFPs.Two other real-world datasets labeled with mEos3.2 were also used. One is an actin cytoskeleton network (Actin2), which is labeled and taken under a similar exposure condition with Actin1, but is completely new and has not been used by previous works. The other one is an Endoplasmic reticulum structure (ER), which has a more complex structure. It is a type of organelle that forms an interconnected network of flattened, membrane-enclosed sacs or tubes known as cisternae, which exhibits different circular-structures and connections at different scales. For the ER dataset, the exposure time is 6.7 ms per frame. The resolution of each image in Actin2 is 263 × 337 pixels () and that in ER is 256 × 170 pixels (). Both datasets have 200 frames of high-density fluorescent images and the same photographing parameters as Actin1. These datasets were used to demonstrate the power of our method in diverse ultrastructures. The detailed procedure for collecting the real-world datasets is given in Supplementary Section S4.Since the 3B analysis (Cox ) is one of the most widely used high-density fluorescent super-resolution techniques, which can deal with high temporal and spatial resolutions (Cox ; Lidke, 2012), it was chosen to compare with our method.
3.3 Performance on simulated datasets
3.3.1 Visual performance
Figure 7 shows the visualization of the ground-truth high-resolution images, representative low-resolution input images, the reconstruction results of the 3B analysis, and the results of our method on the simulated datasets. Due to the space limitation, we illustrate three representative areas of each dataset and leave the fourth in Supplementary Section S5.
Fig. 7.
Visualization of the ground-truth high-resolution images, representative low-resolution input images, the reconstruction results of the 3B analysis (Cox ), and the results of our method on three representative areas of each simulated dataset: MT (columns 1–3) and Tub (columns 4–6). The four rows show the ground-truth high-resolution images, the first frames of the simulated time-series low-resolution images, the reconstruction results of the 3B analysis and the reconstruction results of DLBI, respectively
Visualization of the ground-truth high-resolution images, representative low-resolution input images, the reconstruction results of the 3B analysis (Cox ), and the results of our method on three representative areas of each simulated dataset: MT (columns 1–3) and Tub (columns 4–6). The four rows show the ground-truth high-resolution images, the first frames of the simulated time-series low-resolution images, the reconstruction results of the 3B analysis and the reconstruction results of DLBI, respectivelyAs shown in Figure 7, the ground-truth images have very clear structures while the low-resolution image frames are very blurry and noisy ( downsampled). To reconstruct the ultrastructures, we ran the 3B analysis with 240 iterations and ran our Bayesian inference module after the deep learning module with 60 iterations. In each iteration, the Bayesian inference module of our method searches four neighbor points for each fluorophore, whereas the 3B analysis takes isolated estimation strategy. Thus the difference in iteration numbers is comparable. Due to the high computational expense of the 3B analysis, each 60 × 60 image was subdivided into nine overlapped subareas for multi-core process, whereas for our method, the entire image was processed by a single CPU core.It is clear that the reconstructions of our method are very similar to the ground-truth in terms of smoothness, continuity, and thickness. On the other hand, the reconstructions of the 3B analysis consist of a number of interrupted short lines and points with thin structures. In general, two conclusions can be drawn from the visual inspection.First, DLBI discovered much more natural structures than the 3B analysis. For example, in the bottom part of Figure 7B, there are two lines overlapping with each other and a bifurcation at the tail. Due to the very low resolution in the input time-series images (e.g. Fig. 7H), neither DLBI nor the 3B analysis was able to recover the overlapping structure. However, DLBI reconstructed the proper thickness of that structure (Fig. 7T), whereas the 3B analysis only recovered a very thin line structure (Fig. 7N). Moreover, the bifurcation structure was reconstructed naturally by DLBI. Similar conclusions can be drawn on the more complex structures in the Tub dataset (columns 4–6 in Fig. 7).Second, DLBI discovered much more latent structures than the 3B analysis. The Tub dataset consists of a lot of lines (tubulins) with diverse curvature degrees (Fig. 7D–F). The reconstructions of the 3B analysis successfully revealed most of the tubulin structures but left the crossing parts interrupted (Fig. 7P–R). As a comparison, the reconstruction results of DLBI recovered both the line-like tubulin structures and most of the crossing parts accurately (Fig. 7V–X).
3.3.2 Quantitative performance
For single-molecule super-resolution fluorescence microscopy, the quantitative performance has been measured by assessing the localization accuracy of single-emitters in each frame (Huang ; Ram ; Small, 2009). For high-density super-resolution fluorescence microscopy, the entire time-series are analyzed and the production is the probability map of the locations of fluorophores.Since the ground-truth is known for the simulated datasets, here we use peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) to measure the reconstruction performance, both of which are widely-used criteria for image reconstruction in computer vision. The performance of the 3B analysis and DLBI on the two simulated datasets are given in Table 1. Here, we denote the four areas (Section 3.2) of each dataset as ‘01’, ‘02’, ‘03’ and ‘04’, respectively. It can be seen that DLBI clearly outperforms the 3B analysis in terms of both PSNR and SSIM on all the areas of the two datasets.
Table 1.
Performance comparison between the 3B analysis and DLBI on the four areas of the two simulated datasets in terms of PSNR and SSIM
Datasets
MT0.N1.HD
Tubulin ConjAL647
01
02
03
04
01
02
03
04
PSNR (dB)
3B
17.99
17.62
17.84
17.89
13.42
15.49
15.00
13.21
DLBI
18.59
19.16
18.51
20.42
18.72
19.17
18.72
16.63
SSIM
3B
0.89
0.89
0.90
0.90
0.74
0.81
0.75
0.69
DLBI
0.92
0.92
0.93
0.94
0.82
0.85
0.80
0.76
The best performance is shown in bold.
3.4 Performance on real datasets
Figure 8 shows the first frame of the time-series fluorescent images for each of the three real-world datasets. Here, we evaluate the performance of our method for both local-patch reconstruction (areas selected by green rectangles) and large-field reconstruction (areas selected by yellow rectangles).
Fig. 8.
The first frame of the time-series fluorescent images for each of the three real-world datasets: (A) Actin1 (Xu ), (B) Actin2 and (C) ER
The first frame of the time-series fluorescent images for each of the three real-world datasets: (A) Actin1 (Xu ), (B) Actin2 and (C) ER
3.4.1 Local-patch reconstruction
Figure 9 shows the first frames of the low-resolution images of the three local-patches, and the reconstruction results of the 3B analysis and DLBI. The regions of interests of the selected patches are and for the three datasets, respectively (Region of interest is usually denoted as , where (X, Y) are the coordinates of the top left point of the rectangle, W is the width of the rectangle, and H is the height of the rectangle.). The temporal resolutions for the two actin datasets and the ER dataset were 10 s and 1.34 s, respectively, according to the exposure time of the image frames.
Fig. 9.
Reconstructions of the local patches of the three real datasets. First column: the Actin1 dataset (the green box in Fig. 8A). Second column: the Actin2 dataset (the green box in Fig. 8B). Third column: the ER dataset (the green box in Fig. 8C). The first row shows the first frames of the time-series low-resolution images. The second row shows the reconstructions of the 3B analysis. The third row shows the reconstructions of DLBI. The reconstructed images are 480 × 480 pixels and the local-patch images are 60 × 60 pixels
Reconstructions of the local patches of the three real datasets. First column: the Actin1 dataset (the green box in Fig. 8A). Second column: the Actin2 dataset (the green box in Fig. 8B). Third column: the ER dataset (the green box in Fig. 8C). The first row shows the first frames of the time-series low-resolution images. The second row shows the reconstructions of the 3B analysis. The third row shows the reconstructions of DLBI. The reconstructed images are 480 × 480 pixels and the local-patch images are 60 × 60 pixelsIt can be seen that the reconstruction results of the 3B analysis capture the main structures in the fluorescent images, but mainly consist of isolated high-illuminating spots, with details being interrupted (Fig. 9D–F). In contrast, the results of DLBI recover most of the latent ultrastructures, and the reconstructed structures have well-estimated fluorophore distribution and continuous depiction (Fig. 9G–I).We further assessed the reconstruction quality of the 3B analysis and DLBI by SQUIRREL (super-resolution quantitative image rating and reporting of error locations) (Culley ). SQUIRREL compares the diffraction-limited image (the reference image) and the reconstructed equivalents to generate a quantitative map, in which two scores are calculated: the resolution-scaled Pearson coefficient (RSP) and the resolution-scaled error (RSE). The higher RSP and lower RSE values, the higher the image quality is. Table 2 shows the RSP and RSE scores for the 3B analysis and DLBI. It is clear that DLBI significantly outperforms the 3B analysis. More detailed comparisons are given in Supplementary Section S5.3.
Table 2.
Performance comparison between the 3B analysis and DLBI on the real datasets in terms of RSP and RSE with SQUIRREL
Dataset
Actin1-patch
Actin2-patch
ER-patch
Criteria
RSP
RSE
RSP
RSE
RSP
RSE
3B
0.583
3915.096
0.770
2196.068
0.827
4077.037
DLBI
0.721
3326.007
0.878
1648.919
0.916
2904.707
The higher RSP and lower RSE values, the higher the image quality is. The best performance is shown in bold.
Performance comparison between the 3B analysis and DLBI on the real datasets in terms of RSP and RSE with SQUIRRELThe higher RSP and lower RSE values, the higher the image quality is. The best performance is shown in bold.
3.4.2 From deep learning to Bayesian inference
Our method combines the strength of deep learning and statistical inference, where deep learning captures both local features in the images and the time-course correlation, and statistical inference removes artifacts from deep learning and enhances physical meaning to the final results. Conceptually, this is equivalent to using the power of deep learning to automatically and systematically explore and extract spatial and temporal features, and taking advantages of the explicit and rigorous mathematical foundation of probabilistic graphical models. Here, we investigate the effectiveness of this combination.Figure 10 demonstrates the outputs of the deep learning module and the Bayesian inference module. It can be seen that the super-resolution images outputted from the deep learning module are very close to the final images from the Bayesian inference module, except for some artifacts and false structures. This is due to two reasons: (i) the abundance of training data provided by our simulation module, which are simulated under the real experimentally-calibrated parameters, enable deep learning to effectively learn spatial and temporal features and (ii) the high diversity of biological structures is still a challenge, which causes the artifacts and false structures to be learned by deep learning.
Fig. 10.
Reconstructions of the three local patches of the three real datasets (the first column) by the deep learning module (the second column) and by deep learning guided Bayesian inference (the third column)
Reconstructions of the three local patches of the three real datasets (the first column) by the deep learning module (the second column) and by deep learning guided Bayesian inference (the third column)After the deep learning module generates the super-resolution image, the Bayesian inference module uses both the original time-series low-resolution images and the deep learning image to statistically infer a ‘false/true’ determination on each fluorophore location and produce the final image. In particular, the false structures are not directly rejected but used as seeds to search for true structures. Therefore, as shown in Figure 10, although the deep learning module outputted some unnatural structures for the Actin2 and ER datasets, these structures were further corrected by the Bayesian inference module.
3.4.3 Runtime analysis
After being trained, running the deep learning model is very computationally inexpensive. Furthermore, the results of deep learning provide a close-to-optimal initialization for Bayesian inference, which also significantly reduces trial-and-error and leads to faster convergence. Figure 11 shows the runtime comparison of the deep learning module, the entire DLBI pipeline, and the 3B analysis on the nine reconstruction tasks (i.e. the six areas of the simulated datasets shown in Fig. 7 and the three local patches of the real datasets shown in Fig. 9). It can be seen that the runtime for the deep learning module ranges between 1–3 min and that of DLBI ranges between 30–40 min. In contrast, the runtime for the 3B analysis is around 75 h, which is more than 110 times higher than that for DLBI. Our results have demonstrated that the super-resolution images from the deep learning module alone is a good estimation to the ground-truth. Therefore, for users who value time and can compromise accuracy, the results from the deep learning module provide a good tradeoff, and thus a good estimation of the ground-truth.
Fig. 11.
Runtime comparison of the deep learning module (DNN), the entire DLBI pipeline (DLBI) and the 3B analysis (3B) on the nine reconstruction tasks (i.e. the six areas of the simulated datasets shown in Fig. 7 and the three local patches of the real datasets shown in Fig. 9). The runtime was measured on a Fedora 25 system with 128 Gb memory and E5-2667v4 (3.2 GHz) CPU
Runtime comparison of the deep learning module (DNN), the entire DLBI pipeline (DLBI) and the 3B analysis (3B) on the nine reconstruction tasks (i.e. the six areas of the simulated datasets shown in Fig. 7 and the three local patches of the real datasets shown in Fig. 9). The runtime was measured on a Fedora 25 system with 128 Gb memory and E5-2667v4 (3.2 GHz) CPU
3.4.4 Large-field reconstruction
To analyze a dataset with 200 frames, each with about 200 × 300 pixels, it takes our method about h on a single CPU core. Therefore, our method is able to achieve large-field reconstruction on the yellow areas shown in Figure 8. Figure 12 shows the large-field reconstruction images of the three real datasets. For the Actin1 dataset, the selected area is 200 × 300 pixels and the reconstructed super-resolution image is 1600 × 2400 pixels. For the Actin2 dataset, the selected area is 250 × 240 pixels and the reconstructed image is 2000 × 1920 pixels. And for the ER dataset, the selected area is 200 × 150 pixels and the reconstructed image is 1600 × 1200 pixels.
Fig. 12.
Large-field reconstructions of the three real datasets: (A) Actin1, (B) Actin2 and (C) ER. (D) The single-molecule reconstruction of the Actin1 dataset by PALM based on 20 000 frames. (E) Overlap of the reconstruction images by DLBI (in red) and PALM (in green)
Large-field reconstructions of the three real datasets: (A) Actin1, (B) Actin2 and (C) ER. (D) The single-molecule reconstruction of the Actin1 dataset by PALM based on 20 000 frames. (E) Overlap of the reconstruction images by DLBI (in red) and PALM (in green)As shown in Figure 12A and B, the actin networks in the two datasets have been successfully recovered by DLBI. The thinning and thickening trends of the cytoskeleton have been clearly depicted, as well as the small latent structures, including actin filaments, actin bundles and ruffles. For the endoplasmic reticulum structure (Fig. 12C), the circular-structures and connections of the cytoskeleton have also been accurately reconstructed.For the Actin1 dataset, the single-molecule reconstruction of the red channel is available (Fig. 12D). This reconstruction was produced by PALM (Hess ) using 20 000 frames, whereas the reconstruction image of DLBI (Fig. 12A) used only 200 frames. We further overlayed the image produced by DLBI with that of PALM to check how well they overlap (Fig. 12E). It is clear that the main structures of the two images almost perfectly agree with each other. In addition, our method was able to recover the latent structure on the top-left part, which was not photographed by PALM due to out of range of views in dual-channel photographing. If we carefully check the original low-resolution fluorescent images, we could find that this predicted structure indeed exists, which is consistent with our reconstruction.
4 Discussion and conclusion
In this article, we proposed a deep learning guided Bayesian inference method for structure reconstruction of super-resolution fluorescence microscopy. Our method combines the strength of deep learning and statistical inference. We further overcame the high data requirement bottleneck of deep learning by a novel stochastic simulation module based on the experimentally-calibrated parameters and problem-specific physical models. It should be noted that although our simulation provides close-to-realistic data to train the deep learning module, it still contains bias and unrealistic parts, which will be learned by the deep learning module. Bayesian inference, on the other hand, can correct and refine the ultrastructure learned by deep learning, and thus enhance the physical meaning of the final super-resolution image.We have comprehensively evaluated the quality of the reconstructed super-resolution images. The future work includes evaluating how well the framework can rediscover the parameters used in the simulation module.
Funding
This work was supported by the Kind Abdullah University of Science and Technology (KAUST) Office of Sponsored Research (OSR) under Awards No. FCC/1/1976-04, URF/1/2602-01, URF/1/3007-01, URF/1/3412-01 and URF/1/3450-01, the National Key Reaseach and Development Program of China [2017YFA0504702], the National natural Science Foundation of China [U1611263, U1611261, 61232001, 61472397, 61502455, 61672493].Conflict of Interest: none declared.Click here for additional data file.
Authors: Miriam A Schwentker; Hannes Bock; Michael Hofmann; Stefan Jakobs; Jörg Bewersdorf; Christian Eggeling; Stefan W Hell Journal: Microsc Res Tech Date: 2007-03 Impact factor: 2.769