Functional ultrasound (fUS) is a rapidly emerging modality that enables whole-brain imaging of neural activity in awake and mobile rodents. To achieve sufficient blood flow sensitivity in the brain microvasculature, fUS relies on long ultrasound data acquisitions at high frame rates, posing high demands on the sampling and processing hardware. Here we develop an image reconstruction method based on deep learning that significantly reduces the amount of data necessary while retaining imaging performance. We trained convolutional neural networks to learn the power Doppler reconstruction function from sparse sequences of ultrasound data with compression factors of up to 95%. High-quality images from in vivo acquisitions in rats were used for training and performance evaluation. We demonstrate that time series of power Doppler images can be reconstructed with sufficient accuracy to detect the small changes in cerebral blood volume (~10%) characteristic of task-evoked cortical activation, even though the network was not formally trained to reconstruct such image series. The proposed platform may facilitate the development of this neuroimaging modality in any setting where dedicated hardware is not available or in clinical scanners.
Functional ultrasound (fUS) is a rapidly emerging modality that enables whole-brain imaging of neural activity in awake and mobile rodents. To achieve sufficient blood flow sensitivity in the brain microvasculature, fUS relies on long ultrasound data acquisitions at high frame rates, posing high demands on the sampling and processing hardware. Here we develop an image reconstruction method based on deep learning that significantly reduces the amount of data necessary while retaining imaging performance. We trained convolutional neural networks to learn the power Doppler reconstruction function from sparse sequences of ultrasound data with compression factors of up to 95%. High-quality images from in vivo acquisitions in rats were used for training and performance evaluation. We demonstrate that time series of power Doppler images can be reconstructed with sufficient accuracy to detect the small changes in cerebral blood volume (~10%) characteristic of task-evoked cortical activation, even though the network was not formally trained to reconstruct such image series. The proposed platform may facilitate the development of this neuroimaging modality in any setting where dedicated hardware is not available or in clinical scanners.
Functional ultrasound (fUS) is an innovative imaging modality that creates
brain-wide neural activity maps at micrometer and millisecond-scale resolution by
tracking temporal cerebral blood volume (CBV) changes in the brain microvasculature
[1]. Similar to blood oxygen level
dependent functional magnetic resonance imaging (BOLD fMRI), the detected CBV
signals provide an indirect measurement of local spiking activity via neurovascular
coupling [2]. However, fUS yields higher
spatiotemporal resolution than fMRI and uses more affordable and portable equipment,
opening the possibility for functional neuroimaging performed directly at the
bedside [3]–[6]. In addition, fUS is more sensitive than fMRI to
optogenetically driven stimuli and provides larger effect size, potentially
facilitating scientific investigations in animal models [7].Preclinically, fUS enables imaging of neural activity in awake and freely
behaving rodents and reduces the confounding factors introduced by
anesthesia/sedation or physical restraint [8],
[9]. Furthermore, fUS has proven useful
for imaging resting state and task-evoked functional connectivity in the rat and
mouse brain [2], [10], [11] and for
mapping neural activation in primates during cognitive tasks and visual stimulation
[12]–[14]. In humans, fUS has been used intraoperatively for
image-monitored brain tumor removal surgeries [4], [5], and in neonates to
visualize epileptic activity and measure functional connectivity through the
anterior fontanel window [3], [6].To detect hemodynamic changes in the brain microvascular network, fUS relies
on highly sensitive power Doppler sequences based on the use of plane wave
emissions. Unfocused ultrasound waves insonify the entire field of view, and the
received radiofrequency (RF) data from tilted plane waves are re-focused (or
beamformed) and coherently compounded to increase resolution and depth of
penetration. This strategy makes it possible to continuously acquire long sequences
of ultrasound data at high frame rates. The obtained compound Doppler signals are
then processed to filter out the strong, undesired clutter originating from the
tissue, and are squared and time-integrated to create power Doppler images with
pixel amplitude proportional to the CBV (Fig.
1a).
Fig. 1.
a, In the state-of-the-art processing, a power Doppler
image is created from a sequence of 250 compound ultrasound frames. In each
pixel, the temporal signal sDopp sampled in the Doppler time
tDopp, is passed through a bank of filters F to remove the tissue
clutter component sclutter. The retained blood signal
sblood is squared and time-integrated to compute the power
Doppler pixel value proportional to cerebral blood volume. b, The
Deep-fUS 3D-Res-UNet architecture uses a modified U-Net network consisting of
residual blocks arranged in a 5-layer encoder followed by a decoder. An input
3-D convolutional layer extracts spatiotemporal features from the 3-D input
structure. The input data is an under-sampled compound sequence created by
selecting the first k frames of
N ×
N pixels (selected frames
displayed with a green border). The network outputs
N ×
N power Doppler images.
c, Residual blocks composed of two cascaded Conv/ReLU/Dropout
layers implement a shortcut connection between the input and output.
d, Representative transfer functions of the input 3-D
convolutional filters learned by the network. These were computed by performing
a fast Fourier transform of the filter kernels averaged in the 3 × 3
spatial domain. The cutoff frequencies (−3 dB) for the two filters are 95
Hz (left) and 58 Hz (right).
The length of the acquisition sequence is critical to effectively
discriminate the weak signals scattered by red blood cells circulating in the blood
stream from the strong clutter originating in the surrounding tissue. When long
observation windows are used, efficient clutter filtration can be achieved in both
large and small vessels by using temporal and singular-value decomposition (SVD)
filters [9], [15], [16]. Conversely, clutter
filtration becomes challenging with shorter acquisitions, in particular in the
smaller vessels where the blood-signal-to-clutter ratio is reduced and the
low-frequency Doppler spectral components overlap with the tissue spectrum. As a
result, fUS imaging implementations use hundreds of compound frames (typically 200
to 400) to create a single power Doppler image.The need to acquire and process large ultrasound datasets poses high demands
on the hardware platform in terms of storage capacity and computational power, with
data throughputs on the order of 250 MSa/image (see detailed calculations in the
Appendix). These requirements make
real-time fUS imaging challenging even in graphics processing unit (GPU)
implementations, and these considerations are yet more relevant for volumetric fUS
sequences [17]–[19]. Therefore, it is highly desirable to achieve
state-of-the-art (SoA) fUS imaging performance with shorter ultrasound acquisitions,
as this may effectively improve access to this imaging modality and expedite its
clinical translation.Here we propose a deep learning platform to reconstruct power Doppler images
from temporally and spatially sparse compound datasets. We implemented convolutional
neural networks (CNNs) based on an encoder-decoder architecture (U-Net) [20]. Variants of this model have been used for
biomedical image reconstruction in applications spanning compressed sensing MRI
[21], sparse-projection photoacoustic
imaging [22], and sparse X-ray computed
tomography [23]. We modified the U-Net by
(i) adding an input layer of 3-D convolutional filters that
extract spatiotemporal features from the 3-D input data; (ii) using
residual blocks that implement shortcut connections between the input and output
features in each layer; and (iii) using a custom loss function.Prior CNN applications in medical ultrasound imaging include contrast
improvement [24] and image de-speckling
[25], ultrasound contrast agent
localization and tracking [26], [27], and under-sampled and adaptive beamforming
[28]–[31]. To our knowledge, our platform is the first attempt
to learn a reconstruction mapping between the sparse sequence of compound ultrasound
data and the power Doppler output image, without requiring any prior model-based
information (Fig. 1b, c). We trained the networks on high-quality power Doppler
images from in vivo acquisitions in rats and using a custom loss
function.
Methods
Deep-fUS Networks
We modified a U-Net and trained it to perform the power Doppler
reconstruction task. This fully convolutional neural network is based on an
encoder/decoder architecture. The encoder progressively down-samples the input
data and learns high-level features that are propagated to the next stages. The
decoder uses up-sampling operators to increase the resolution of the encoder
features and to consecutively restore the input resolution at the output stage.
Skip connections between the encoding and decoding paths allow retention of
context information, which is propagated to the symmetric up-sampling
layers.Software for all the Deep-fUS networks, trained models, and test data
sets are available at https://github.com/todiian/deep-fus.
3D-Res-UNet:
In the 3D-Res-UNet model, we modified the U-Net by adding an input
layer of 4 3-D convolutional filters followed by rectified linear unit
(ReLU) activations. This layer extracts spatiotemporal features from the 3-D
input structure. In addition, we replaced convolutional layers with residual
blocks composed of two cascaded Conv/ReLU/Dropout layers and included
shortcut connections between the input and output features. Residual blocks
were arranged in a 5-layer encoder followed by a 4-layer decoder and
implement 3 × 3 convolutions followed by ReLU activations and a
dropout layer to improve regularization. We used 1 × 1 convolutions
at the input of each layer to equalize the number of input and output
features of each residual block. In the encoder path, down-sampling is
performed by a 2 × 2 max pooling operator that halves the resolution
in both the image dimensions. In the decoder, 2 × 2 transposed
convolutions with ReLU activations are used as up-sampling operators. The
number of channels is progressively increased in the encoder (32, 64, 128,
256, and 512 filters) and then decreased in the decoder (256, 128, 64, and
32 filters). The output layer is a single-channel 1 × 1 convolution
block. The stride is equal to 1 in all the convolutional layers and 2 in the
max pooling and transposed convolution blocks. This network has a total of
9,788,421 trainable parameters (Table
II). The size of the filter kernels at the input stage and the
dropout rate were considered as hyperparameters and were optimized via
Bayesian optimization. All the convolutional kernels were initialized using
the He initialization [32].
3D-UNet, UNet, and PP-UNet:
In addition to the 3D-Res-UNet, we trained and optimized three
networks. The 3D-UNet uses simple convolutional blocks in place of residual
blocks. Specifically, each layer is composed of 2 consecutive 3 × 3
convolution blocks, each followed by ReLU activations and dropout for
network regularization. The output layer is a single-channel 1 × 1
convolution block. The stride is equal to 1 in all the convolutional layers
and 2 in the max pooling and transposed convolution blocks. The size of the
filter kernels in the first layer and the dropout rate were considered as
hyperparameters and were optimized using Bayesian optimization.The UNet is analogous to the 3D-UNet except for the absence of the
3-D convolutional filters at the input. These two networks were
independently trained and optimized to separately analyze the effects on the
reconstruction performance of the input 3-D convolutional filters and of the
residual shortcut connections. In addition, we trained and optimized a
network with the same characteristics as the above U-Net to perform the
post-processing of power Doppler images that were generated by conventional
processing of sparse compound sequences. We refer to this network as
postprocessing (PP)-UNet.All the convolutional kernels were initialized using the He
initialization [32].
Datasets
We trained the networks to learn a function y =
f (x) that maps the input sequence x of
compound frames of N ×
N pixels to the output power
Doppler image y of dimensions
N ×
N (Fig. 1b). In all our experiments, we used images of 96
× 96 pixels, and we standardized the input compound datasets. We chose to
base the processing on beamformed data instead of sensor RF data to minimize
data throughput and storage (see Appendix). SoA images were obtained from in vivo
acquisitions of coronal and sagittal slices of the rat brain reconstructed by
state-of-the-art power Doppler processing using 250 complex compound frames. To
improve the network regularization, we performed random cropping when more than
96 pixels were available in any image dimension, and a random horizontal
flipping was applied with a probability of 50%.The training, validation, and test sets were created by selecting pairs
of compound data and power Doppler images from a total of 58 in
vivo acquisitions of coronal brain slices recorded in n = 15 rats
between 2.7 mm anterior and 7.0 mm posterior to bregma [33]. In addition to the variability given by the
different subjects and anteroposterior position, each pair captured distinct
hemodynamic states and levels of anesthesia. We used 740 pairs from 50 data
acquisitions for training, 40 pairs from 4 acquisitions for validation, and 40
pairs from 4 acquisitions for testing. Furthermore, in n = 1 rat we recorded
data in sagittal view to demonstrate the generalization capabilities of the
method.We performed under-sampling of the compound sequences in the temporal
domain by selecting the first k frames in each sequence (Fig. 1b). We retained only the real part of
the beamformed data. For the experiments in Fig.
6, we also under-sampled the compound frames in the image domain by
selecting sub-samples of pixels with a ratio m =
N/N, with
N the number of retained
pixels and N = 962
the number of total image pixels.We calculated the final compression factor as
where the factor of 1/2 accounts for the missing imaginary
part.
Training and Hyperparameter Optimization
At each iteration, the networks predict a new estimate
, and the parameters are learned using the Adam
optimizer with default settings β1 = 0.9,
β2 = 0.999, and ε
= 10−7 [34], [35] to minimize the loss function
withIn the above equations, y denotes the SoA training images, ∥
· ∥1 the l1 norm,
N the number of image pixels, and n the
number of examples. The structural dissimilarity index metric loss
LSSIM is a perceptual loss based on the structural similarity
index metric (SSIM), which integrates luminance, contrast, and structural
information [36]. A kernel of 3 ×
3 pixels was used for the SSIM calculation. We considered the learning rate and
the parameter λ as hyperparameters, and their optimal
value was determined via Bayesian optimization.We based our quantitative performance analysis on the SSIM of the
reconstructed images versus the respective SoA images, the normalized mean
squared error , with ∥ · ∥2
the l2 norm, and on the peak signal-to-noise ratio
(PSNR). We implemented the networks in Python using TensorFlow 2.1 with Keras
API. The networks were trained on a single NVIDIA Titan RTX GPU with 24 GB of
RAM mounted on a Dell Precision Tower workstation (20-core Intel Xeon 3.3 GHz;
32 GB of RAM). The mini-batch size was set to 1 in all the experiments.For each network, we first optimized the hyperparameters using the
Bayesian optimization routine in the Keras Tuner library. We ran 15 optimization
trials using the sparse dataset with CF 75%. The optimization routine was
instructed to maximize the validation SSIM. Each trial trained the
reconstruction CNNs for 2500 epochs and selected the model with the best
performance. The results of the optimal hyperparameter search for all the
networks are reported in Table I. Then,
we trained the CNNs with the respective optimal hyperparameters using CFs of
80%, 85%, 90%, and 95%. We trained the 3D-Res-UNet and 3D-UNet networks for 1500
epochs (we noted that these CNNs converged more quickly during optimization),
the U-Net for 2500 epochs, and the PP-UNet for 500 epochs. In all trainings, we
saved the model with the best validation SSIM.
TABLE I
Results of Bayesian Optimization
Network
Hyperparameter
Optimized value
3D-Res-UNet
Conv3D,
k1,2
3
Conv3D,
k3
16
Learning rate
5.5 × 10−4
Dropout rate
0.2
Lambda
0.1
3D-UNet
Conv3D,
k1,2
1
Conv3D,
k3
16
Learning rate
1.1 × 10−4
Dropout rate
0.1
Lambda
0.9
UNet
Learning rate
7.4 × 10−5
Dropout rate
0.2
Lambda
0.8
PP-UNet
Learning rate
7.4 × 10−4
Dropout rate
0.1
Lambda
0.2
Ultrasound System and Data Acquisition
For ultrasound data acquisition, we used two 128-element linear array
transducers (L22–14vX and L22–14vLF; Verasonics Inc.) operating at
a 15-MHz center frequency with a Vantage 256 research scanner (Verasonics Inc.).
The probes are geometrically identical apart from the focus in the elevation
plane; the L22–14vX is focused at a distance of 8 mm, and the
L22–14vLF is focused at 20 mm. For exact positioning relative to the
skull landmarks, the imaging probe was housed in a custom 3-D printed holder
mounted on a motorized positioning system. Ultrasound gel was used for acoustic
coupling. We used tilted plane waves at angle (−6°,
−3°, 0°, 3°, 6) emitted with a pulse repetition
frequency of 19 kHz. Two consecutively emitted plane waves were averaged for
each angle to increase the signal-to-noise ratio, giving a total of 10 emissions
per compound frame. We acquired data for 250 compound frames at a rate of 1 kHz
(i.e., acquiring a new sequence of compound frames takes 250 ms), and the data
for each compound sequence (250 · 10 emissions) were transferred in batch
to the host computer. Compound frames were created by beamforming the received
sensor RF data in a regular grid of pixels of 100 μm
× 100 μm in an NVIDIA Titan RTX GPU using a GPU
beamformer [37]. Ultrasound data were
acquired asynchronously and continuously, i.e., a new sequence of frames was
acquired during processing of the previous sequence and held in the scanner
buffer until the host computer was available. The compound frames were saved on
the host machine for offline processing. The final power Doppler frame rate was
0.6 frames/s.
Conventional Power Doppler Processing
Sequences of compound ultrasound frames were processed in MATLAB
(MathWorks, Inc.) for clutter filtration and power Doppler computation. We used
a 5th-order temporal high-pass Butterworth filter with a cutoff
frequency of 40 Hz cascaded with an SVD filter that eliminates the first
singular value [9]. In the Doppler space,
frequencies are linearly proportional to the velocity of the scatterers from
which the Doppler signal originated. Therefore, it is expected that signals
emanating from the slowly moving tissue surrounding the blood vessels (clutter)
are positioned at around 0 Hz, and this assumption justifies the use of a
temporal high-pass filter. Singular value decomposition filters aim to eliminate
highly coherent signal components and assume that, while blood signals are
highly incoherent due to the time-varying stochastic distribution of the moving
scatterers (red blood cells), tissue signal maintains a high degree of
correlation over time. At each pixel location (x,
y), the intensity of the filtered signal was then
calculated to find the power Doppler value (Fig. 1a).
For the SoA processing (250 complex compound frames), the entire time window of
250 ms was integrated.
Animal Preparation and Imaging Experiments
The experimental protocol for the animal study was approved by the
Institutional Animal Care and Use Committee at Stanford University. The
university’s animal care and use program and facilities are AAALAC
International accredited, PHS-assured, and USDA licensed. Long Evans and Sprague
Dawley rats (Charles River; n = 16; age 10–14 weeks; weight
260–400 g) were used in this study. We prepared the animals by performing
a bilateral surgical craniotomy and chronic prosthesis implant using previously
published protocols [8]. Briefly, animals
were anesthetized with 3.5% isoflurane in oxygen and anesthesia was maintained
with 1.5% isoflurane. Rats were placed in a stereotaxic frame during surgery for
head fixation and orientation. Body temperature was monitored by a rectal probe
and maintained at 36.5 °C using a warming pad (RightTemp Jr.; Kent
Scientific). A pulse oximeter was used to monitor heart rate and arterial oxygen
saturation (MouseStat Jr.; Kent Scientific). We administered an
anti-inflammatory agent to prevent brain swelling and inflammation (1 mg/kg
dexamethasone intraperitoneally). After a skin incision was performed, parietal
and frontal skull bone fragments (AP +4 to −9 mm; ML ±6 mm) were
cut using a handheld high-speed drill with a 0.7 mm drill bit (Fine Science
Tools). We gently removed the bone flaps, paying special attention to avoid any
damage to the dura mater. We used dental cement (Tetric EvoFlow; Ivoclar
Vivadent) to seal a 125 μm thick polymethylpentene
prosthesis covering the entire craniotomy. The bone was pre-treated with a
bonding agent (iBOND Total Etch; Kulzer). The space between the dura mater and
the polymer prosthesis was filled with 0.9% sterile saline. Animals were then
allowed to recover for 1 week before the first imaging session.During the imaging sessions, animals were either anesthetized and kept
under anesthesia with 1.5% isoflurane while placed in a stereotaxic frame or
were lightly sedated with 0.5% isoflurane and kept in a restraining apparatus
[38]. The restrained imaging protocol
was also used in the lightly sedated fUS experiment of Fig. 7.
Visual Stimulation Protocol and Functional Activity Maps
To evaluate whether the Deep-fUS approach provides sufficient accuracy
in the reconstruction of time series of power Doppler images in a functional
neuroimaging application, we imaged visual task-evoked brain activation in rats
exposed to binocular green light stimulation. Rats were anesthetized, placed in
a stereotaxic frame, and kept in a dark chamber for at least 30 min prior to the
visual stimulation session for dark adaptation. Six bilateral visual stimuli
were delivered using two green light LEDs driven by a custom circuit. We
controlled the stimulus pattern through a microcontroller board (Arduino Uno)
connected to MATLAB via the serial port and interfaced with the Verasonics
scanner for synchronization with the imaging sequence. For each light stimulus,
the LEDs were flashed for 30 s at a frequency of 3 Hz. Each stimulus was
followed by a >30 s pause in a pseudo-random fashion. We did not average
the resulting CBV traces over multiple trials. This stimulation protocol was
shown to maximize visual cortex response in prior fUS imaging studies [39].The temporal CBV signals were filtered using a 6th-order
median filter and a 2nd-order Butterworth highpass filter with cutoff
frequency 0.006 Hz to remove the DC offset. Functional activation maps were then
created by calculating the Pearson’s correlation coefficient
r between the temporal power Doppler signal and the
stimulus pattern [1], [9]. We used a Fisher’s transformation to
calculate the z scores as
with N = 250 the number of temporal samples for
a total time of 416 s. Each pixel was considered significant for
z > 4.74, corresponding to P
< 0.01 in a one-tailed t-test after Bonferroni
correction. We used this threshold to create binary activation maps that only
show the significant pixels.
Results
The received sensor RF data from 10 plane wave emissions were beamformed in
a regular grid of 96 × 96 pixels with a spatial resolution of 100
μm × 100 μm to create
compound frames. Sequences of compound frames were then processed to compute the
power Doppler images. The conventional processing achieves a satisfactory level of
detail in coronal brain images reconstructed from 250 complex compound frames (Fig. 2b). The resulting SoA images were used for
the CNN training and as a reference for evaluating the reconstruction performance.
To test the power Doppler reconstruction with under-sampled data, we retrospectively
created sparse data sequences by selecting subsamples of k compound
frames from each sequence, with CF of 75% (k = 125), 80%
(k = 100), 85% (k = 75), 90%
(k = 50) and 95% (k = 25). Power Doppler
images reconstructed by the conventional processing with under-sampled data appear
increasingly noisy due to the reduced blood flow sensitivity (Fig. 2c).
Fig. 2.
a, Representative power Doppler image of a coronal slice of
the rat brain reconstructed by Deep-fUS (3D-Res-UNet) from under-sampled
sequences with compression factor of (CF) 75%, 85%, and 95% (Top) and absolute
error images calculated against the state-of-the-art (SoA) image (Bottom).
b, SoA image reconstructed by the conventional processing using
250 complex compound frames. c, Power Doppler images reconstructed
with the conventional processing using under-sampled compound data (Top) and
respective absolute error images (Bottom). Scale bar in a,
b, c: 1 mm.
Deep-fUS Power Doppler Reconstruction
The Deep-fUS networks blindly solve a reconstruction problem to directly
extract the power Doppler values from a sequence of compound frames. The
networks take in input a sparse compound sequence and output the corresponding
power Doppler image (Fig. 1b). The results
of the Bayesian hyperparameter optimization are reported in Table I.The Deep-fUS reconstruction restored the reference imaging performance
and was able to create maps of the rat brain microvasculature from sparse data
with CF of up to 95% (Fig. 2–4). Our CNNs produced a considerable
improvement in the under-sampled power Doppler reconstruction when compared to
the conventional processing, as confirmed by the quantitative metrics in Table II. While all the trained CNNs
performed significantly better than the conventional processing with sparse
data, the 3D-Res-UNet achieved overall superior reconstruction performance, with
maximum SSIM of 0.92, PSNR of 30.29 dB, and minimum NMSE of 0.04. Introducing
the 3-D convolutional input layer resulted in a maximum SSIM improvement of 0.07
(CF 75%), PSNR improvement of 2.16 dB (CF 80%), and NMSE reduction of 0.08 (CF
95%). The residual connections were responsible for a further maximum SSIM
increase of 0.01, PSNR increase of 0.75 dB, and NMSE reduction of 0.02, all in
the CF 75% case. As further discussed in Sec.
III–B, the 3D-Res-UNet also provided overall better
performance in the computation of functional activation maps, for which we
report the mean absolute error (MAE) in Table
II.
TABLE II
Quantitative Performance Metrics
Network
CF
SSIM
PSNR (dB)
NMSE
Activation Map MAE
3D-Res-UNet
(reconstruction)
75%
0.9166 ± 0.0174
30.2851 ± 1.5516
0.0399 ± 0.0222
0.1019
80%
0.9041 ± 0.0192
29.4798 ± 1.5038
0.0448 ± 0.0278
0.1038
85%
0.8915 ± 0.0211
28.8217 ±1.4282
0.0506 ± 0.0224
0.1193
90%
0.8619 ± 0.0264
27.8481 ± 1.4058
0.0701 ± 0.0472
0.1274
95%
0.8154 ± 0.0353
26.7270 ± 1.2469
0.1106 ± 0.0364
0.1724
3D-UNet (reconstruction)
75%
0.9046 ± 0.0189
29.5322 ± 1.4163
0.0647 ± 0.0259
0.1161
80%
0.8981 ± 0.0201
29.4254 ± 1.2746
0.0652 ± 0.0335
0.1278
85%
0.8874 ± 0.0202
29.0893 ± 1.2615
0.0728 ± 0.0308
0.1264
90%
0.86 ± 0.0238
28.2881 ± 1.2772
0.0913 ± 0.0464
0.1597
95%
0.8139 ± 0.0373
27.1319 ± 0.7524
0.1177 ± 0.0495
0.1858
UNet (reconstruction)
75%
0.8394 ± 0.0342
27.6412 ± 0.7867
0.1222 ± 0.0488
0.1362
80%
0.8348 ± 0.0323
27.2672 ± 0.7142
0.1375 ± 0.0456
0.1517
85%
0.8384 ± 0.0325
27.3371 ± 0.8499
0.1399 ± 0.0385
0.1355
90%
0.8237 ± 0.0337
26.7846 ± 0.8862
0.1398 ± 0.0467
0.1376
95%
0.7927 ± 0.0444
26.1352 ± 0.8169
0.1946 ± 0.0550
0.1674
PP-UNet
(post-processing)
75%
0.9269 ±0.0153
31.1037 ± 1.2606
0.0359 ± 0.0239
0.1020
80%
0.9165 ± 0.0179
30.6757 ± 1.4619
0.0396 ± 0.0224
0.1100
85%
0.902 ± 0.0212
30.2345 ± 1.1474
0.0459 ± 0.0242
0.1245
90%
0.876 ± 0.0233
29.1286 ± 1.3275
0.061 ± 0.0323
0.1499
95%
0.8226 ± 0.0376
27.3628 ± 0.9142
0.102 ± 0.0537
0.1691
Conventional
75%
0.6955 ± 0.0362
21.3873 ± 0.4584
0.2726 ± 0.0208
0.1602
80%
0.6799 ± 0.0387
21.0242 ± 0.4763
0.2793 ± 0.0202
0.1808
85%
0.6553 ± 0.0438
20.4423 ± 0.5235
0.2938 ± 0.0303
0.2061
90%
0.6142 ± 0.0497
19.5135 ± 0.6068
0.3193 ± 0.0349
0.2440
95%
0.5181 ± 0.0585
17.4375 ± 0.6265
0.3947 ± 0.0652
0.2942
All the metrics for the different models were calculated versus the
respective reference images. CF: compression factor; SSIM: structural
similarity Index metric; PSNR: peak signal-to-noise ratio; NMSE: normalized
mean squared error; MAE: mean absolute error.
Fig. 2 displays a representative
coronal slice of the rat brain reconstructed by the 3D-Res-UNet and by the
conventional approach, and scatter plots in Fig.
3a highlight that reconstruction errors are more prominent in
correspondence of the lower power Doppler values, particularly in the case of
conventional processing. To further confirm the generalization capabilities of
our approach, we used the network trained on coronal brain images to reconstruct
a sagittal slice, as displayed in Fig. 4.
The mean prediction time for the 3D-Res-UNet calculated on the test set was
between 4.4 and 13.5 ms/image (Table
III). Movies showing side-by-side comparisons of the conventional and
Deep-fUS reconstruction with CF 75%, 85%, and 95%, are available in Video S1–3.
Fig. 3.
a, Scatter plots of the power Doppler pixel amplitudes and
linear regression analysis (y = b1x + b2).
b-d, Structural similarity index metric (SSIM), normalized mean
squared error (NMSE), and peak signal-to-noise ratio (PSNR) of power Doppler
images reconstructed by Deep-fUS (3D-Res-UNet; blue) and by the conventional
approach (red). The quantitative metrics were calculated against the respective
SoA reference images. Results are reported as mean (solid line) and standard
deviation (shaded area) calculated over the test set.
Fig. 4.
a, Representative power Doppler image of a sagittal slice
of the rat brain reconstructed by Deep-fUS (3D-Res-UNet) from under-sampled
sequences with compression factor (CF) 75%, 85%, and 95% (Top) and absolute
error images calculated against the state-of-the-art (SoA) image (Bottom).
b, SoA image reconstructed by the conventional processing.
c, Power Doppler images reconstructed with the conventional
processing using under-sampled compound data (Top) and respective absolute error
images (Bottom). Scale bar in a, b, c: 1
mm.
TABLE III
Deep-fUS Networks Parameters and Processing Times
Network
3D-Res-UNet
3D-UNet
UNet
PP-UNet
Conventional
N. Layers
5 + 1
5 + 1
5
5
-
N. trainable
parameters
9,788,421
8,773,701
8,665,633
8,629,921
-
Training
epochs
1500
1500
2500
500
-
Reconstruction time
(ms/image)
CF 75%
13.5
7.12
5.28
2.06 + 255.7
255.7
CF 80%
11.1
5.67
4.6
2.04 + 220.3
220.3
CF 85%
9.2
4.77
3.88
2.09 + 207.2
207.2
CF 90%
6.7
3.89
3.34
2.08 + 194.8
194.8
CF 95%
4.4
2.95
2.66
2.04 + 183.4
183.4
The PP-UNet provides imaging performance comparable to the 3D-Res-UNet
(Table II) but adds a run-time
overhead of 212 ms/image (averaged over all the CFs) for the conventional
processing of the sparse power Doppler images, which are then provided as input
to the neural network for post processing (Table
III). It is also worth noting that this approach, like the
conventional method, is inherently dependent on the design of the tissue clutter
filter, and a wide range of parameters have been used in the fUS literature for
the temporal high-pass and SVD filters. Therefore, a data-driven method that
does not require prior model-based knowledge would be highly advantageous and
potentially simplify fUS implementations. Interestingly, we noted that the
learned convolutional filters in the input layer of the 3D-Res-UNet implement
highpass transfer functions with strong rejection of the 0-Hz component (Fig. 1d). These filters appear to mimic the
filters used in the conventional processing but are learned directly from the
data during training.
Task-Evoked Functional Activity Imaging
Power Doppler images of the brain vasculature only provide a screenshot
of the CBV signal at a given time. In fUS neuroimaging applications, the
functional information is extracted from the temporal evolution of the CBV
signals in the form of temporal correlations with a stimulus (i.e., functional
activation) or between brain regions of interest (i.e., functional
connectivity). We thus sought to test the performance of the Deep-fUS approach
in a visual-evoked functional activation application.The SoA activation map is shown for reference in Fig. 5b. This was created by using conventionally
reconstructed power Doppler images using the full compound dataset and shows
significant bilateral activation of the primary and secondary visual cortices
(V1/2) and superior colliculus (SC). In Fig.
5c we show the activation maps created using power Doppler time
series reconstructed by Deep-fUS (3D-Res-UNet) using sparse data with CF between
75% and 95%. Although the quality of the activation maps degraded with
increasing data sparsity, significant V1/2 and SC activation could be detected
with a CF up to 95%. The network generalized well to the reconstruction of time
series of power Doppler images and was able to accurately reproduce the small
changes in relative CBV signal (~10%; Fig.
5e–f) characteristic of
visual-evoked cortical activation. It is worth pointing out that the network did
not formally learn to perform the reconstruction of such image sequences, as it
was trained on single images, and had no direct knowledge of the functional
information content of the sequence.
Fig. 5.
a, Time series of power Doppler images were recorded
continuously during a visual stimulation task. The resulting cerebral blood
volume (CBV) signals were correlated with the stimulus pattern. The visual
stimulus consisted of 6 light stimuli, each with an ON time of 30 s, distributed
in a pseudo-random fashion. b, State-of-the-art (SoA) activation
map computed using power Doppler images reconstructed by the conventional
approach using 250 complex compound frames. Statistically significant pixels (P
< 0.01; Bonferroni corrected) are shown in the heat map. The white
contour displays the slice at bregma −7.0 mm from the Paxinos brain
atlas. The activation map shows significant bilateral activation of the rat
primary and secondary visual cortices (V1/2) and superior colliculus (SC).
c, Activation maps computed using power Doppler images
reconstructed by Deep-fUS with compression factor (CF) between 75% and 95%.
d, Activation maps computed using power Doppler images
reconstructed by conventional processing with CF between 75% and 95%.
e, Relative CBV signals in the statistically significant pixels
of the SoA map in V1/2 for the SoA data (magenta), and Deep-fUS (blue) and
conventional processing (red) with CF of 95%. Results are reported as mean
(solid line) and standard deviation (shaded area). The stimulus pattern is
displayed in gray. f, Relative CBV signals in the statistically
significant pixels of the SoA map in SC for the SoA data (magenta), and Deep-fUS
(blue) and conventional processing (red) with CF of 95%.
Notably, Deep-fUS with CF 95% performed better than the conventional
approach with CF 75%. With shorter data sequences (i.e., higher CF), the
conventional processing provides increasingly noisy CBV temporal signals that
result in lower and non-significant correlations with the stimulus (Fig. 5e–f). This is also confirmed by the quantitative error metric (MAE) in
Table II. Introducing the 3-D
convolutional input layer and residual connections reduced the activation maps
MAE in all the cases except for CF ≥ 90%, as compared to the simple
U-Net. This may be due to the short temporal signals that make it more
challenging to train the 3-D filters.To investigate an alternative approach to data compression, we created
temporally and spatially under-sampled sequences by retaining only a subset of
compound samples in each frame, with a spatial under-sampling ratio
m = 1/2 and m = 1/4. The spatial samples
were selected as shown in the bottom plots of Fig.
6. We used k = 50 and k = 100 in
the two cases to equalize the CF to 95%. This approach improved the quality of
the functional activation maps compared to the case with temporal under-sampling
only (Fig. 6), and the resulting activation
map MAE was 0.1406 for m = 1/2 and 0.1279 for
m = 1/4. These results suggest that spatial sparsity may be
a viable option to further increase data compression while retaining the
advantages of longer acquisitions. On the other hand, this approach does not
benefit from the decreased sensitivity to motion artifacts discussed in the next
section.
Fig. 6.
Representative power Doppler test images (top) and activation maps
(middle) computed with spatially under-sampled sequences with spatial sampling
ratio m = 1/2 (a) and m = 1/4 (b). To equalize the
compression factor (CF) to 95%, k = 50 and k = 100 compound frames were used in
the two cases. The spatial sampling maps are displayed in the bottom plots. The
black and white pixels show discarded and retained pixels, respectively. The
maps are shown in a sub-grid of 12 × 12 pixels.
A movie of the Deep-fUS power Doppler series and relative CBV variation
from the visual-evoked experiment with temporally and spatially sparse data is
provided in Video
S-4.
Motion Artifact Reduction
To determine whether shorter acquisition sequences reduce the occurrence
of motion artifacts, we used Deep-fUS to reconstruct a time series of power
Doppler images acquired in a lightly sedated and restrained animal. We computed
the SSIM of each image in the series versus a baseline calculated as the median
of all images in the acquisition, then we applied a SSIM threshold to filter out
the images that showed significant degradation, possibly due to animal motion
(Fig. 7). In the case of SoA processing
using the full compound sequence, 8.2% of power Doppler images were discarded by
the filter (Fig. 7a). Image scrubbing was
reduced to between 4.5% (with CF 75%) and 2.1% (with CF 95%), giving a maximum
scrubbing reduction of 74%. Fig. 7c
displays a representative SoA power Doppler image that was discarded by the SSIM
filter. Motion artifacts were resolved in the same image processed by Deep-fUS
(Fig. 7d). We provide a side-by-side
comparison of the SoA and Deep-fUS image scrubbing in the lightly sedated fUS
experiment in Video
S-5.
Fig. 7.
a, A series of 1000 power Doppler images was filtered based
on a structural similarity index metric (SSIM) filter. Black dots display the
discarded images in the series. b, We defined a threshold (red) to
remove the images with an SSIM value lower than 3 standard deviations from the
baseline. c, d, Representative power Doppler coronal
images in a case of significant degradation in the conventional reconstruction
(c) that was completely resolved with under-sampled processing
(d). Scale bar: 1 mm.
Discussion
Deep learning and CNNs are drawing increasing attention for the
reconstruction and processing of biomedical images with sparse data [40]–[42].
In medical ultrasound, several strategies have been proposed to restore high image
quality while reducing data sampling, transmission, and processing [28], [29], [31], [43]. With the exception of a single preliminary study reporting deep
learning of color Doppler images [44],
however, CNNs have not been applied as extensively to ultrasound imaging of blood
flows. We proposed a deep learning platform for the direct reconstruction of power
Doppler images from a 3-D space of sparse compound ultrasound data (Fig. 1). Our proposed approach largely enhanced imaging
performance compared to the conventional method with compression factors up to 95%,
as clearly indicated by the presented quantitative metrics (Fig. 3). We demonstrate that our method is able to
reconstruct time series of power Doppler images with sufficient accuracy to compute
functional activation maps in a task-evoked neuroimaging application (Fig. 5 and 6).
Although it was not formally trained for such reconstruction task, the network
generalized well and accurately reproduced changes in relative CBV signals on the
order of 10%. Additionally, we show that by minimizing the length of the acquisition
sequence, our network allows greater robustness to motion artifacts in an experiment
with a lightly sedated animal and is less sensitive to image scrubbing (Fig. 7).The first advantage of using sparse sequences is the net reduction in data
acquisition, storage, and processing resources, which may effectively simplify fUS
development. To provide an indication of a representative scenario, in our
implementation with 250 compound images each made of 10 plane waves, we acquire and
beamform 2500 full B-mode-equivalent frames to compute a single power Doppler image.
The acquisition lasts 250 ms, and RF data from each acquisition are sampled and
stored on the scanner. The first bottleneck is the direct memory access (DMA) for
data transfer to the host machine via PCIe bus. Data are then beamformed at runtime
in a GPU to reduce storage requirements (see calculation in the Appendix) and saved on a solid-state drive. Processing (on
the host machine) and data acquisition (on the scanner) happen in parallel. The
power Doppler computation is performed offline, as SVD filters are also
computationally demanding. Our resulting frame rate excluding power Doppler
processing is 0.6 Hz. However, it is important to note that our field of view is
limited, as we are imaging the rat brain in coronal view (~10 mm × 10
mm), but these considerations become yet more significant in situations where larger
regions are imaged, as in larger animals or humans.The suggested approach may facilitate the development of fUS neuroimaging in
any setting where dedicated hardware is not available or in clinical scanners,
making this technology more affordable and opening the way to new potential
applications based on this imaging modality. Additionally, sparse sequences may
prove beneficial in experimental situations where fUS acquisitions need to be
interleaved with long therapeutic ultrasound pulses, such as in the monitoring of
focused ultrasound neurointerventions [45],
[46]. Although in this study we
retrospectively under-sampled the compound data, we clearly demonstrate that the
network may considerably reduce the beamforming complexity and eliminate the need
for computationally demanding filters [15],
[16]. Additionally, the network has the
potential to increase the imaging frame rate and to facilitate the implementation of
volumetric fUS imaging using swept linear arrays [6]. The platform and conceptual framework that we propose may be adapted
to other high-frame-rate Doppler ultrasound imaging modalities, including vector
flow imaging, to expedite their deployment in portable ultrasound systems [47], [48].To create sparse sequences, we selected subsets of compound images in the
initial portion of the original sequence. This approach has the advantage of
yielding shorter temporal acquisition windows and reduces the occurrence of motion
artifacts and data scrubbing, which are inevitable in mobile rodent and handheld fUS
imaging applications [5], [9], [49].
Alternative sampling approaches, including interleaved or randomly distributed
frames, may be integrated with our proposed platform by training and optimizing the
networks with the desired sampling method. We expect that longer observation times
will facilitate the power Doppler reconstruction, as displayed in Fig. 6, but will make the sequence more sensitive to
motion artifacts, as discussed in Sec.
III–C. Adaptive sampling based on a motion measure could be an
exciting future development for this power Doppler reconstruction approach.We acknowledge that variants of the U-Net have been previously applied to
different biomedical imaging modalities. However, most of the literature is focused
on removing artifacts from sub-optimally reconstructed images. We were specifically
interested in demonstrating a data-driven reconstruction method that, once trained,
requires no prior model-based knowledge of the image formation process nor requires
hand-picked parameters. We decided to base our implementation on the U-Net as we
hypothesized that its encoder-decoder architecture would fit the nature of our data.
A critical step in the power Doppler reconstruction process is the filtration of the
strong clutter signal originating from the moving tissue. In the 3-D space formed by
the image plane and Doppler time, the clutter signal is slowly varying in the
temporal domain and highly correlated in the spatial domain, therefore it is crucial
to account for both spatially and temporally varying features in the reconstruction
process. By progressively expanding the spatial field of view in the encoder layers
and with the input filters performing temporal convolutions, our network extracts
spatiotemporal features from severely under-sampled input datasets. By using more
sophisticated networks, interesting applications may be developed in the future
based on the current work. Unsupervised algorithms may be designed to train variants
of generative adversarial networks (for example, CycleGAN [50]) on 2-D power Doppler images for the reconstruction
of volumetric fUS data acquired with sparse physical apertures. Considering the
cost, complexity, and bulkiness of 3-D ultrasound systems, such advances may greatly
facilitate 4-D fUS imaging applications.A main limitation of using ultrasound for brain imaging is the presence of
the skull, which is highly absorbing at the imaging frequencies. This has limited
clinical fUS to intraoperative applications or to scenarios with natural skull
openings, such as the neonatal anterior fontanel window. These limitations may be
partly overcome by using contrast agents to enhance the ultrasound signal [51]–[53].Finally, further validation of our reconstruction approach is recommended,
as changes in the functional activation maps at higher compression factors may
distort the interpretation of clinical or scientific imaging data.
Authors: Dongwoon Hyun; Leandra L Brickson; Kevin T Looby; Jeremy J Dahl Journal: IEEE Trans Ultrason Ferroelectr Freq Control Date: 2019-03-08 Impact factor: 2.725
Authors: Sumner L Norman; David Maresca; Vassilios N Christopoulos; Whitney S Griggs; Charlie Demene; Mickael Tanter; Mikhail G Shapiro; Richard A Andersen Journal: Neuron Date: 2021-03-22 Impact factor: 17.173
Authors: Petteri Stenroos; Jaakko Paasonen; Raimo A Salo; Kimmo Jokivarsi; Artem Shatillo; Heikki Tanila; Olli Gröhn Journal: Front Neurosci Date: 2018-08-20 Impact factor: 4.677