Literature DB >> 35789489

Deep learning methods for enhancing cone-beam CT image quality toward adaptive radiation therapy: A systematic review.

Branimir Rusanov^1,2, Ghulam Mubashar Hassan¹, Mark Reynolds¹, Mahsheed Sabet^1,2, Jake Kendrick^1,2, Pejman Rowshanfarzad^1,2, Martin Ebert^1,2.

Abstract

The use of deep learning (DL) to improve cone-beam CT (CBCT) image quality has gained popularity as computational resources and algorithmic sophistication have advanced in tandem. CBCT imaging has the potential to facilitate online adaptive radiation therapy (ART) by utilizing up-to-date patient anatomy to modify treatment parameters before irradiation. Poor CBCT image quality has been an impediment to realizing ART due to the increased scatter conditions inherent to cone-beam acquisitions. Given the recent interest in DL applications in radiation oncology, and specifically DL for CBCT correction, we provide a systematic theoretical and literature review for future stakeholders. The review encompasses DL approaches for synthetic CT generation, as well as projection domain methods employed in the CBCT correction literature. We review trends pertaining to publications from January 2018 to April 2022 and condense their major findings-with emphasis on study design and DL techniques. Clinically relevant endpoints relating to image quality and dosimetric accuracy are summarized, highlighting gaps in the literature. Finally, we make recommendations for both clinicians and DL practitioners based on literature trends and the current DL state-of-the-art methods utilized in radiation oncology.

Entities: Chemical

Keywords: AI; CT; adaptive radiotherapy; cone-beam CT; deep learning; image synthesis; synthetic CT

Mesh：

Year: 2022 PMID： 35789489 PMCID： PMC9543319 DOI： 10.1002/mp.15840

Source DB: PubMed Journal: Med Phys ISSN： 0094-2405 Impact factor: 4.506

INTRODUCTION

Artificial intelligence (AI) is expected to both disrupt and transform standard practices in healthcare. Radiation oncology has traditionally been at the forefront of medical technology adoption, a tradition that demands expertise in both theoretical and practical aspects of a given technology. Hence, this review aims to broaden both clinicians’ and researchers’ understanding of state‐of‐the‐art (SoTA) deep learning (DL) methods currently employed in cone‐beam CT (CBCT) image correction, summarize clinically relevant results, and offer constructive considerations in advancing the research. Adaptive radiation therapy (ART) has shown tremendous promise in improving patient outcomes by sparing healthy tissues and escalating dose‐to‐tumor volumes. , , , CBCT‐driven dose monitoring enables the recalculation of the dose on updated patient anatomy prior to irradiation. Hence, clinicians can monitor the validity of the plan and decide to trigger an adaptive protocol if dosimetric deviations from the initial prescription are deemed clinically relevant. The irradiation parameters are then modified, either online or offline, to achieve optimal tumor coverage whilst minimizing dose to healthy tissues. Online ART has not been widely adopted into the clinical workflow due to practical limitations involving the integration of specialized tools for patient assessment, plan adoption, and quality assurance. Of critical importance to the first consideration is CBCT image quality. In their most rudimentary implementation, CBCT images are unable to reproduce accurate tissue density, suffer from artifacts capable of undermining subsequent clinical application, and have inferior contrast relative to diagnostic grade CT imaging. The focus of this investigation is to review DL methods for correcting CBCT image quality, placing emphasis on methodological novelties such as loss functions and architectural/model design. The article is divided into five parts consisting of background information, methods, results, discussion, and conclusion. Background information introduces readers to fundamental DL components necessary to understand SoTA approaches in medical image synthesis, along with detailed descriptions of the most common evaluation metrics for assessing image quality and dose accuracy. The methods section outlines what criteria were used in compiling the review. The results section summarizes the most salient trends throughout the literature, whereas the discussion section draws on these trends to make recommendations for both clinicians and DL practitioners.

BACKGROUND

A theoretical understanding of the basic components that underlies SoTA algorithms provides a foundation for researchers to build on and allows clinicians to appreciate the technology that could underpin future workflows. What follows is a brief discussion introducing the concept of DL (1), convolution operations and layers (2), model optimization and the role of loss functions (3), and a description of the most popular image synthesis architectures (4). Table 1 summarizes the strengths and limitations of these architectures.

TABLE 1

Benefits and limitations of three common deep learning (DL) architectures: U‐Net, GAN (generative adversarial network), and cycle‐GAN

Architecture	Strengths	Limitations
U‐Net	Simplest implementation Stable convergence Fastest training	Paired data only Anatomic misalignments reduce model accuracy and image realism
GAN	Paired or unpaired training Improved image realism due to adversarial loss Model tunability	Moderate implementation difficulty Unstable convergence Slower training Poor structure preservation for unpaired data
Cycle‐GAN	Paired or unpaired training Model tunability Improved image realism due to adversarial loss Good structure preservation	Complex implementation Unstable convergence Slowest training Highest hardware requirements

Benefits and limitations of three common deep learning (DL) architectures: U‐Net, GAN (generative adversarial network), and cycle‐GAN Simplest implementation Stable convergence Fastest training Paired data only Anatomic misalignments reduce model accuracy and image realism Paired or unpaired training Improved image realism due to adversarial loss Model tunability Moderate implementation difficulty Unstable convergence Slower training Poor structure preservation for unpaired data Paired or unpaired training Model tunability Improved image realism due to adversarial loss Good structure preservation Complex implementation Unstable convergence Slowest training Highest hardware requirements

Deep learning

Machine learning (ML) is a field of computer science interested in developing algorithms that accomplish complex tasks without being explicitly programed to do so by observing data. DL is a subfield of ML that specifically uses artificial neural networks—computational units inspired by biological synaptic responses—to process data into desired outputs. The stacking of many such neuronal hidden layers gives “depth” to the network, thus reflecting the term “deep” in DL. The convolutional neural network (CNN) is a specialized framework that excels in computer vision problems. The interested reader can refer to Yamashita et al. for an accessible overview of general mechanisms and building blocks comprising CNNs in the radiological context. In the context of medical image synthesis (also termed image‐to‐image translation or domain transfer), four major components are required, namely, the convolution layer, a model architecture, and the loss function and optimizer.

The convolution layer

The convolution operation—not limited to DL—is a linear function used for image‐based feature extraction. For example, early attempts at modeling scatter from primary signals in radiographic systems used a convolution‐filtering method. Figure 1a depicts how image features are extracted by convolution between the image and filter (also known as kernel). During convolution, an element‐wise multiplication between the kernel and image is followed by a summation over each of the values. The convolution output, also known as a feature map, contains the results of all convolution operations. The spatial dimensions of the feature map depend on the image padding, size of the filter, and stride of the filter. The stride controls the distance between two adjacent convolutions (stride = 3 in Figure 1a). The filter size controls how much information is extracted per convolution, with smaller feature maps resulting from larger filters. Finally, image dimensions may be increased by applying border padding (typically zero pixels), enabling a precise control of the output feature map spatial dimension. The choice of padding, filter size, and stride are hyperparameters set by the practitioner and remain unchanged during training. Conversely, the parameters contained within the filters are learned during the optimization stage.

FIGURE 1

(a) The convolution output (feature map) results from element‐wise multiplication followed by summation between the filter and image. Note how image information is encoded into a reduced spatial dimension. (b) Depiction of the U‐Net architecture. Note how the input spatial dimensions are progressively reduced, whereas the feature dimension increases with network depth. (c) The GAN architecture comprising a generator and discriminator. Generators are typically U‐Net‐type architectures with encoder/decoder arms, whereas discriminators are encoder classifiers. (d) The Cycle‐GAN network comprising two generators and discriminators capable of unpaired image translation via computation of the cycle error. The orange arrows indicate the backward synthesis cycle path. The convolution layer, whose task is to extract meaningful features, is defined by the application of an arbitrary number of these filters per layer, each followed by a nonlinear processing unit (or activation function). Chemical synapses activate by forming a potential difference between junctions and breaching a threshold voltage. So to do neurons in a CNN “activate” by satisfying a nonlinear function—the simplest being the rectified linear unit (ReLU). ReLU outputs zero (no signal) if the input is a negative value, else mapping the input as output. Each filter contained in a layer outputs a unique feature map, which is stacked into a 3D feature map volume ready to be processed by a deeper convolutional layer. Hence, deeper convolution layers extract hierarchically more complex feature representations, where each filter in deeper layers has a “depth” dimension matching the input feature map depth. The impressive expressivity of neural networks stems from the combined use of such nonlinear functions and deep network architectures. Equally important in the image translation literature is the transposed convolution layer. Typically, a series of convolution layers will downsample the spatial dimensions of an image (by virtue of the convolution operation), whilst increasing the feature dimensions (controlled by the number of filters per layer). The role of transposed convolutions is to upsample the feature map spatial dimensions such that we may return an output image with the same dimensions as the input. In transposed convolution, each element of the kernel is multiplied by a single element in the input feature map for a given position. The result is stored in the output feature map before moving to the next input feature map element. Any overlapping regions of each transposed convolution operation are summed to form the final output feature map. Furthermore, in transposed convolution the padding decreases the output spatial dimensions, whereas the stride determines the number of zero elements inserted between each input element, hence increasing the output dimensions. Finally, a larger kernel size will form a larger transposed convolution output, thereby increasing the output feature map dimensions.

Loss function and optimization

Training a model requires discovering the optimal parameters that process the input data into a satisfactory output. Hence, the model must have a measure of how wrong its predictions are—the loss function; and a strategy to adjust its parameters to minimize this loss—gradient descent optimization. The loss function computes a distance metric between the model prediction and the ground‐truth data which it is trying to approximate. A typical loss function for paired training arrangements is the mean absolute error (MAE), also known as L1 distance. Here, the average absolute magnitude difference between predicted and ground‐truth data is computed on a per‐pixel basis. This loss is a function of the millions or billions of trainable parameters contained in the model, and each unique configuration of the parameters in turn has its own loss value. Hence, the loss with respect to each parameter can be thought of as a hyperdimensional plane, which contains local peaks and troughs and a global minimum and maximum. In gradient descent optimization, the task is to adjust the network parameters in the direction that reduces the loss. Ideally, we stop training when model weights arrive at the global minimum, which represents the lowest achievable loss value for the given dataset and architecture. In practice, the model may not reach the global minimum but a local trough. Regardless, gradient descent operates by computing the partial derivative of each parameter with respect to the loss. This indicates the local slope of the loss with respect to that parameter. A given parameter is then updated by subtracting it with its partial derivative, which effectively moves the model state toward a lower loss state. The learning rate is a hyperparameter that controls the size of steps taken toward that minima.

U‐Net

With the basic building blocks and optimization frameworks in mind, it becomes possible to define the most widely used medical DL architecture, U‐Net, as depicted in Figure 1b. Based on the autoencoder architecture, U‐Net is suitable for both classification or regression tasks, capable of pixel‐wise predictions (as opposed to image‐wide) by utilizing skip connections and a fully convolutional framework. The encoding portion of the network passes the input image through consecutive convolution layers with serially increasing numbers of output feature maps. The image is encoded in a compressed latent space with a reduced spatial dimension but increased feature dimension. The decoding portion of the network reassembles an output image using consecutive transpose convolution layers that restore the input spatial image dimensionality while reducing feature dimensions. U‐Net differs from autoencoders by enabling the accurate reconstruction of spatial information during upsampling by the use of skip connections from encoder‐side convolution layers to the corresponding decoder‐side layers. These connections, which concatenate encoder feature maps to the decoder side, propagate spatial and contextual information across the network to help the decoder reassemble a more accurate output using queues from the input. The fully convolutional structure of U‐Net means that no fully connected dense layers are needed at the output, drastically reducing model parameters while enabling a per‐pixel prediction. Hence, more computational resources can be directed to expanding model depth, in turn, increasing predictive capacity. U‐Net‐based image translation tasks typically require pixel‐wise loss functions for supervision; hence, input and target domain images must be paired to achieve satisfactory results. In the medical context, perfectly aligned imaging data is only attainable through post‐processing corruption of the target domain to resemble the input domain. Alternatively, same‐day scans offer the best anatomical match but are logistically difficult to acquire. Else, data pairing is achieved through a rigid and deformable registration of same‐patient scans. When coupled with per‐pixel losses, anatomical discrepancies in the training data imbed an unavoidable error into the model that is propagated to future predictions typically manifesting as a loss of boundary sharpness or false anatomy artifacts. , , ,

Generative adversarial networks

Unlike U‐Net, which can be considered a single generator constrained by hand‐crafted loss functions, the generative adversarial network (GAN) instead consists of two networks: a generator and a discriminator that each minimizes their own loss in a competitive two‐player setting. In a GAN framework (shown in Figure 1c), the discriminator is a classification network whose task is to discern real from generated samples, meanwhile the generator is tasked with generating samples that can fool the discriminator. During a single pass, GAN training is performed sequentially: First, the discriminator is trained to minimize its classification loss between real and generated samples (discriminator loss). Next, the generator is trained to maximize the likelihood that a generated sample will be classified as real by the discriminator (adversarial loss). In turn, the discriminator loss and adversarial loss are dynamic as each network improves and provides impetus for the other. GAN optimality is reached when a Nash equilibrium is established: where a networks loss cannot be minimized further without altering the other networks parameters. At this equilibrium, the discriminator will equally classify real and generated samples with a probability of 0.5. The GAN framework discussed so far has been purely generative in nature: synthesizing realistic outputs from input noise. For the special case of medical image synthesis, conditional GANs are introduced. They function in the same way but accept images as input rather than noise. Compared with U‐Net, the GAN framework has several advantages. For one, the adversarial loss minimizes differences in the data distribution and deep features between the two domains. Consequently, paired training data is not required, and the resulting synthetic images are perceptually more realistic. In practice, unsupervised GAN implementations are highly unconstrained as the set of realistic generator outputs that can fool the discriminator is large. This is problematic for medical image synthesis as the patient anatomy may not be preserved, or density information may not be retained even if the style of the target image is attained. To remedy this, a GAN may be constrained using paired data and per‐pixel losses. Other issues associated with GAN optimization include difficulty balancing generator and discriminator hyperparameters for stable training, uncertainty as when to cease training as Nash equilibrium convergence rarely manifests, longer training, and higher hardware requirements.

Cycle‐GAN

Cycle‐GAN is a variant of the conditional‐GAN framework that introduces forward and backward domain synthesis to enforce a “cycle‐consistency” loss using two generators and two discriminators. The major benefit of cycle‐consistency is the preservation of anatomic information during synthesis for unpaired datasets. Figure 1d demonstrates the cycle‐loss for the CBCT and CT wings of the network. For a given CBCT image, generator A outputs a synthetic CT (sCT), after which generator B transforms the sCT back into a cycle‐synthetic CBCT. By enforcing a pixel‐wise loss between the original CBCT and cycle‐synthetic CBCT, anatomic preservation is encouraged during the initial generator A transformation. The same set of transformations is applied to the input CT. Cycle‐GAN achieves SoTA performance on unpaired data owing to the combination of cycle‐consistency and adversarial losses. However, cycle‐consistency is a strict regularization technique that constrains the generator to output images that can be easily inverted to the original domain. Although desirable for anatomic preservation, cycle‐consistency becomes problematic where large changes to the output are desired, such as in the presence of motion artifacts. In these instances, the generator preserves aspects of the artifact as a prompt to recover them during the backward synthesis cycle. , Generally, the issues that inflict GANs are present for cycle‐GAN, albeit more severely as greater computational resources are required, more precise fine‐tuning of hyperparameters is needed, and training times are further increased.

Evaluation metrics

Typical evaluation metrics for assessing image quality between ground truth and corrected image sets are presented in Table 2. Ground‐truth data consists of CT images that have undergone deformable image registration (DIR) to the CBCT, or CBCT images that have been scatter corrected using Monte Carlo (MC) or a previously validated CT‐prior method. , , The most cited metric is MAE, which linearly compares the average absolute pixel‐wise error deviations over the entire image or within specific regions of interest (ROI) or the patient body contour. Mean error (ME) and (root) mean squared error ((R)MSE) assess the degree of systematic error shift, and prominence of large error deviations, respectively. The peak signal‐to‐noise ratio (PSNR) is a measure used to quantify the magnitude of noise relative to signal affecting the CBCT in comparison to the ground truth. Finally, the structural similarity (SSIM) index is used to assess perceptual qualities of corrected and ground‐truth images based on statistical measures of luminance, contrast, and structure. Recently, the Dice similarity score, used to quantify segmentation accuracy, has been used as a surrogate metric for image quality. The Dice score measures the area of overlap between ground truth and automated segmentations. In the context of CBCT image quality, automated segmentations performed on corrected and non‐corrected CBCT images are compared to manual or automatic segmentations performed on DIR CT.

TABLE 2

Summary of common image and dose based similarity metrics

	Metric	Formula
Image similarity	MAE/ME ↓	1n∑i=1n\|CBCTi−CTi\|/1n∑i=1n(CBCTi−CTi)
	MSE/RMSE ↓	1n∑i=1n(CBCTi−CTi)2/MSE
	PSNR ↑	20log10MAXCTRMSE
	SSIM ↑	2μCBCTμCT+c12σCBCT,CT+c2μCBCT2+μCT2+c1σCBCT2+σCT2+c2 with μx=mean; σx2=variance; σx,y=covariance; c1=(k1L)2; c2=(k2L)2; L=luminance; k1=0.01; k2=0.03
	DICE ↑	2\|AreaCBCT∩AreaGroundTruth\|\|AreaCBCT\|+\|AreaGroundTruth\| with ∩=intersection
Dosimetric similarity	DPR ↑	Fraction of voxels where DD ⩽ x% with DD=DCBCT−DCTDCTΔ100 where D=dose
	DVH ↑	Cumulative histogram of dose–volume frequency distribution for a given volume
	GPR ↑	Fraction of voxels where γ≤1

Note: Arrows indicate better result.

Abbreviations: DPR, dose difference pass rate; DVH, dose–volume histogram; GPR, gamma pass rate; MAE, mean absolute error; ME, mean error; MSE, mean squared error; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; SSIM, structural similarity.

Summary of common image and dose based similarity metrics with ; ; ; ; ; ; ; with Fraction of voxels where DD ⩽ x% with where Note: Arrows indicate better result. Abbreviations: DPR, dose difference pass rate; DVH, dose–volume histogram; GPR, gamma pass rate; MAE, mean absolute error; ME, mean error; MSE, mean squared error; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; SSIM, structural similarity. The ART process can involve the recalculation of treatment dose on the corrected CBCT images. In investigating the suitability of corrected CBCT images for ART, the dosimetric accuracy can be validated by applying the same clinical plan used during treatment planning on the ground‐truth CT and corrected CBCT images. A variety of metrics such as dose difference pass rate (DPR), dose–volume histogram (DVH) metrics, and Gamma pass rates (GPR) are commonly used. The DPR is a pixel‐wise metric that quantifies the percentage of pixels that satisfy a set dose difference threshold between corrected and ground truth images. The DVH compares the cumulative dose to different structures as a function of volume. Comparisons to clinically relevant criteria can be made, such as the volume of tissue receiving a given prescription dose, the percentage difference of which can be compared between corrected and ground truth volumes. Finally, the Gamma index is a composite function of dose difference and distance‐to‐agreement criteria used to gauge the similarity of two dose distributions. Calculated in either 2D or 3D, the GPR measures the percentage of points that satisfy the specified criteria.

METHODS

PubMed, Scopus, and Web of Science were searched using the terms and inclusion/exclusion criteria outlined in Figure 2. The goal was to review the CBCT‐specific literature and include any investigations that used DL to improve image quality suitable for ART. Criteria were narrowed to exclude CBCT acquisitions that do not comport with ART, for example: Studies using low‐dose scans that do not explicitly strive for ART, 4D scans, and C‐arm or dental modalities. We did not limit our investigation to methods that only sought to generate sCT images using DL as alternative approaches in the projection domain show promising results and are worthy of discussion. Finally, we restricted our criteria to only include peer‐reviewed journal articles.

FIGURE 2

Flowchart of study selection process

Flowchart of study selection process An initial selection screening was performed based on the title and abstract of the articles returned after the database search. Duplicate results or items that did not meet the inclusion criteria were removed. Post screening, the full‐text articles were retrieved for a review. The methods of each investigation were reviewed and information pertaining to model architecture, loss functions, cohort size and split, anatomical region, training configuration, augmentations, and preprocessing were extracted. When available, the corresponding results that reported image similarity metrics (MAE, (R)MSE, SSIM, ME, PSNR) and dose accuracy (GPR, DVH, DPR) were also extracted. The information is summarized in separate tables under categories “sCT generation” (Tables 3A, 3B, 3C) and “projection based” (Table 4) based on the authors’ aims. Studies defined under “sCT generation” set their target domain as CT images, whereas studies under “projection based” employ a range of DL‐driven scatter correction techniques to improve CBCT image quality.

TABLE 3A

Summary of synthetic CT (sCT) generation methods

Author and year	Anatomic site	Model	Loss function	Augmentation	Preprocessing	(train/val/test)	Training configuration	Image similarity (input CBCT)	Dose similarity
Kida et al. 2018 ⁵³	Pelvis	U‐Net	MAE		Voxels outside body set to −1000 HU Intra‐subject RR and DIR Masked CT to CBCT contour	5‐CV (16/0/4)	Paired axial 2D	SSIM: 0.967 (0.928). PSNR: 50.9 (31.1). RMSE: 13 (232)
Xie et al. 2018 ⁶²	Pelvis	Deep‐CNN	MSE		Intra‐subject DIR 2D patches DIR patches	15/0/5	Paired axial patch 2D	PSNR: 8.823 (7.889). Anatomy ROI mean HU
Chen et al. 2019 ⁵⁴	HN	U‐Net	MAE SSIM		Resample CT to CBCT Rescaled HU [0,1] Intra‐subject RR plan CT and CBCT Intra‐subject DIR replan CT and CBCT	30/7/7	Dual‐input paired axial 2D	MAE: 18.98 (44.38). PSNR: 33.26 (27.35). SSIM: 0.8911 (0.7109). RMSE: 60.16 (126.43)
Chen et al. 2019 ⁵⁴	Pelvis	U‐Net	MAE SSIM			6/0/7	Paired axial 2D	MAE: 42.40 (104.21). PSNR: 32.83 (27.59). SSIM: 0.9405 (0.8897). RMSE: 94.06 (163.71)
Harms et al. 2019 ³⁵	Brain	Cycle‐GAN	Adversarial loss Cycle loss (L1.5 norm) Synthetic loss (L1.5 norm) Gradient loss		Resample CBCT to CT Intra‐subject RR Inter‐subject RR to common volume Air truncation 3D patches	LoO‐CV (23/0/1)	Paired patch 3D	MAE: 13.0 ± 2.2 (23.8 ± 5.1). PSNR: 37.5 ± 2.3 (32.3 ± 5.9)
Harms et al. 2019 ³⁵	Pelvis	Cycle‐GAN				LoO‐CV (19/0/1)	Paired patch 3D	MAE: 16.1 ± 4.5 (56.3 ± 19.7). PSNR: 30.7 ± 3.7 (22.2 ± 3.4)
Kida et al. 2019 ³⁶	Pelvis	Cycle‐GAN	Adversarial loss Cycle loss Total variation loss Air loss Gradient loss Idempotent loss	Gaussian noise	Voxels outside body set to −1000 HU Intra‐subject RR plan CT and CBCT Air truncation Intra‐subject DIR replan CT and CBCT HU clipped [−500, 200] HU rescaled [−1,1]	16/0/4	Unpaired axial 2D	Average ROI HU. Volume HU histograms. Self‐SSIM
Kurz et al. 2019 ³⁷	Pelvis	Cycle‐GAN	Adversarial loss Cycle loss	Random cropping Random left‐right flips	Intra‐subject RR Voxels outside body set to −1000 HU CT/CBCT downsampled HU clipped [−1000,2071], rescaled 16 bit	4‐CV (25/0/8)	Unpaired axial 2D	MAE: 87 (103) ^a . ME: −6 (24) ^a	DD1: 89%. DD2: 100%. DVH < 1.5%. DD2: 80%. DD3: 86%. GPR2: 96%. GPR3: 100%. DVH < 1%
Lei et al. 2019 ³⁸	Brain	Cycle‐GAN	Adversarial loss Cycle loss Synthetic loss		Intra‐subject RR	LoO‐CV (11/0/1)	Paired patch 3D	MAE: 20.8 ± 3.4 (44.0 ± 12.6). PSNR: 32.8 ± 1.5 (26.1 ± 2.5)
Li et al. 2019 ⁵⁵	Nasopharynx	U‐Net	MAE	Random left‐right flips Random positional shifts	Resample CT to CBCT Intra‐patient RR	50/10/10	Paired axial 2D	MAE: 6–27 (60–120). ME: −26–4 (−74–51)	DVH < 0.2% (0.8%). GPR1: 95.5% (90.8%)
Liang et al. 2019 ³⁹	HN	Cycle‐GAN	Adversarial loss Cycle loss Identity loss		Resample CT to CBCT HU rescaled [−1,1] Intra‐patient DIR on test data	81/9/20	Unpaired axial 2D	MAE: 29.85 ± 4.94 (69.29 ± 11.01). RMSE: 84.46 ± 12.40 (182.8 ± 29.16). SSIM: 0.85 ± 0.03 (0.73 ± 0.04). PSNR: 30.65 ± 1.36 (25.28 ± 2.19)	GPR2: 98.40% ± 1.68% (91.37% ± 6.72%). GPR1: 96.26% ± 3.59% (88.22% ± 88.22%)
Barateau et al. 2020 ⁶¹	HN	GAN	Perceptual loss Adversarial loss	Random translations, rotations, shears	Intra‐patient RR and DIR	30/0/14	Paired axial 2D	MAE: 82.4 ± 10.6 (266.6 ± 25.8) ^a . ME: 17.1 ± 19.9 (208.9 ± 36.1) ^a	GPR2: 98.1% (91.0%). DVH (OAR) < 99 cGy. DVH (PTV) < 0.7%
Eckl et al. 2020 ⁴⁰	HN	Cycle‐GAN	Adversarial loss Cycle loss Synthetic loss		Thorax and HN HU clipped [−1000,4000] Pelvis HU clipped [−1000,1000] HU rescaled [−1,1] Intra‐patient RR Images resampled 224 × 224	25/0/15	Paired axial 2D	MAE: 77.2 ± 12.6 ^a . ME: 1.4 ± 9.9 ^a	GPR3: 98.6 ± 1.0%. GPR2: 95.0 ± 2.4%. DD2: 91.5 ± 4.3%. DVH < 1.7%
	Thorax					53/0/15		MAE: 94.2 ± 31.7 ^a . ME: 29.6 ± 30.0 ^a	GPR3: 97.8 ± 3.3%. GPR2: 93.8 ± 5.9%. DD2: 76.7 ± 17.3%. DVH < 1.7%
	Pelvis					205/0/15		MAE: 41.8 ± 5.3 ^a . ME: 5.4 ± 4.6 ^a	GPR3: 99.9 ± 0.1%. GPR2: 98.5 ± 1.7%. DD2: 88.9 ± 9.3%. DVH < 1.1%

Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm.

Abbreviations: CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross‐validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross‐validation; MAE, mean absolute error; ME, mean error; MSE, mean squared error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; ROI, regions of interest; RR, rigid registration; SSIM, structural similarity.

Image similarity metrics computed within body contour.

TABLE 3B

Summary of synthetic CT (sCT) generation methods

Author and year	Anatomic site	Model	Loss function	Augmentation	Preprocessing	(train/val/test)	Training configuration	Image similarity (input CBCT)	Dose similarity
Liu et al. 2020 ⁴¹	Abdomen	Cycle‐GAN	Adversarial loss Cycle loss Synthetic loss		Intra‐patient RR and DIR CBCT resampled to CT	LoO‐CV (29/0/1)	Paired patch 3D	MAE: 56.89 ± 13.84 (81.06 ± 15.86) ^a . PSNR: 28.80 ± 2.46 (22.89 ± 2.89) ^a . SSIM: 0.71 ± 0.032 (0.60 ± 0.063) ^a	DVH < 0.8%
Maspero et al. 2020 ⁴²	HN Lung Breast	Cycle‐GAN	Adversarial loss Cycle loss	Random left‐right flipping Random 30 × 30 cropping	Voxels outside largest circular mask on CBCT and CT set to −1000 HU Intra‐patient RR Images resampled 286 × 286 HU clipped [−1024,3071] HU rescaled [0,1] CT anatomy outside CBCT FOV stitched on	15/8/10 15/8/10 15/8/10	Unpaired axial 2D	MAE: 51 ± 12 (195 ± 20) ^a . ME: −6 ± 6 (−122 ± 33) ^a MAE: 86 ± 9 (219 ± 44) ^a . ME: −5 ± 14 (153 ± 48) ^a MAE: 67 ± 18 (152 ± 40) ^a . ME: −5 ± 11 (71 ± 37) ^a	GPR3: 99.3 ± 0.4%. GPR2: 97.8 ± 1% GPR3: 98.2 ± 1%. GPR2: 94.9 ± 3% GPR3: 97 ± 4%. GPR2: 92. ± 8%
Park et al. 2020 ⁴³	Lung	Cycle‐GAN	Adversarial loss Cycle loss		CT and CBCT resampled to 384 × 384	8/0/2	Unpaired sagittal and coronal 2D	PSNR: 30.60 (26.13). SSIM: 0.8977 (0.8173)
Thummerer et al. 2020 ⁵⁶	HN	U‐Net	MAE		Voxels outside body set to −1000 HU Intra‐patient RR and DIR CT and CBCT masks reduced to common voxels Slices containing shoulders removed	3‐CV (16/2/9)	Paired axial, sagittal and coronal 2D	MAE: 40.2 ± 3.9 ^a . ME: −1.7 ± 7.4 ^a	GPR3: 98.77 ± 1.17%. GPR2: 96.57 ± 3.26%
Thummerer et al. 2020 ⁵⁷	HN	U‐Net	MAE	Small translations Random left‐right mirroring	Voxels outside body set to −1000 HU Intra‐patient RR and DIR CT and CBCT masks reduced to common voxels Slices containing shoulders removed	3‐CV (11/11/11)	Paired axial, sagittal, coronal 2D	MAE: 36.3 ± 6.2 ^a . ME: 1.5 ± 7.0 ^a	GPR3: 99.95%. GPR2: 99.30%
Xie et al. 2020 ⁶³	Pelvis	Deep‐CNN	Contextual loss	Random rotations	Intra‐patient DIR	499/64/64 (slices)	Paired axial 2D	MAE: 46.01 ± 5.28 (51.01 ± 5.38). PSNR: 23.07 (22.66). SSIM: 0.8873 (0.8749)
Yuan et al. 2020 ⁵⁸	HN	U‐Net	MAE		Intra‐patient RR Images cropped 256 × 256 Central 52 slices used	5‐CV (40/5/10)	Paired axial 2D	MAE: 49.24 (167.46). SSIM: 0.85 (0.42)
Zhang et al. 2020 ¹⁸	Pelvis	GAN	Feature matching MAE	Random left–right flipping Random small angle rotation Background noise	Intra‐patient DIR HU rescaled to mean of 0, STD of 1 (standardized)	150/0/15	Paired multi‐slice axial 2.5D	MAE: 23.6 ± 4.5 (43.8 ± 6.9). PSNR: 20.09 ± 3.4 (14.53 ± 6.7)	DVH < 1%
Dahiya et al. 2021 ¹⁵	Thorax	GAN	Adversarial loss MAE	Geometric augmentation (scale, sheer, rotation)	CBCT artifact injection into CT HU clipped [−1000, 3095] HU rescaled [−1, 1] Intra‐patient DIR Image resampled to 128 × 128 × 128	140/0/15	Paired 3D	MAE: 29.31 ± 12.64 (162.77 ± 53.91). RMSE: 78.62 ± 78.62 (328.18 ± 84.65). SSIM: 0.92 ± 0.01 (0.73 ± 0.07). PSNR: 34.69 ± 2.41 (22.24 ± 2.40)
Dai et al. 2021 ⁵¹	Breast	Cycle‐GAN	Adversarial loss Cycle loss			52/0/23		MAE: 71.58 ± 8.78 (86.42 ± 10.12) ^a . ME: 8.46 ± 11.88 (−37.71 ± 15.49) ^a . PSNR: 23.34 ± 3.63 (20.19 ± 5.26). SSIM: 0.92 ± 0.02 (0.88 ± 0.04)	GPR3: 91.46 ± 4.63%. GPR2: 85.09 ± 6.28%. DVH (CTV) < 3.58%
Dong et al. 2021 ⁵⁰	Pelvis	Cycle‐GAN	Adversarial loss Cycle loss Identity loss		Images resampled 1 × 1 × 1‐mm grid HU rescaled [−1, 1] Voxels outside body set to −1000 HU	46/0/9	Unpaired axial 2D	MAE: 14.6 ± 2.39 (49.96 ± 7.21). RMSE: 56.05 ± 13.05 (105.9 ± 11.52). PSNR: 32.5 ± 1.87 (26.82 ± 0.638). SSIM: 0.825 ± 1.92 (0.728 ± 0.36)
Gao et al. 2021 ⁴⁹	Thorax	Cycle‐GAN	Adversarial loss Cycle loss Identity loss		Intra‐patient RR CT FOV cropped to CBCT HU clipped [−1000, 1500] HU rescaled [−1, 1] Images resampled 256 × 256	136/0/34	Unpaired axial 2D	MAE: 43.5 ± 6.69 (92.8 ± 16.7). SSIM: 0.937 ± 0.039 (0.783 ± 0.063). PSNR: 29.5 ± 2.36 (21.6 ± 2.81)	GPR3: 99.7 ± 0.39% (92.8 ± 3.86%). GPR2: 98.6 ± 1.78% (84.4 ± 5.81%). GPR1: 91.4 ± 3.26% (50.1 ± 9.04%)
Liu et al. 2021 ²⁷	Thorax	Modified ADN	Adversarial loss Attribute consistency loss Reconstruction loss Self‐reconstruction loss SSIM loss	Random horizontal flip	Resample CT/CBCT to 1 × 1 × 1‐mm grid Resample to 384 × 384 Extract 256 × 256 image patches HU clipped [−1000, 2000] HU rescaled [−1, 1] Intra‐patient RR	32/8/12	Unpaired axial 2D patch	MAE: 32.70 ± 7.26 (70.56 ± 11.81). RMSE: 60.53 ± 60.53 (112.13 ± 17.91). SSIM: 0.86 ± 0.04 (0.64 ± 0.04). PSNR: 34.12 ± 1.32 (28.67 ± 1.41)

Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm.

Abbreviations: ADN, artifact disentanglement network; CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross‐validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross validation; MAE, mean absolute error; ME, mean error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; RR, rigid registration; SSIM, structural similarity.

Image similarity metrics computed within body contour.

TABLE 3C

Summary of synthetic CT (sCT) generation methods

Author and year	Anatomic site	Model	Loss function	Augmentation	Preprocessing	(train/val/test)	Training configuration	Image similarity (input CBCT)	Dose similarity
Qiu et al. 2021 ⁴⁸	Thorax	Cycle‐GAN	Adversarial loss Cycle loss Histogram matching loss Synthetic loss Gradient loss Perceptual loss	Rotations Flips Rescaling Rigid deformations	Intra‐patient RR and DIR	5‐CV (16/0/4)	Paired axial 2D	MAE: 66.2 ± 8.2 (110.0 ± 24.9) ^a . PSNR: 30.3 ± 6.1 (23.0 ± 4.0) ^a . SSIM: 0.91 ± 0.05 (0.85 ± 0.05)
Rossi et al. 2021 ⁶⁰	Pelvis	U‐Net	MAE	Random 90° rotations Horizontal flip	Voxels outside body set to −1000 HU HU clipped [−1024, 3200] HU rescaled [0, 1] Intra‐patient RR Image resampled to 256 × 256	4‐CV (42/0/14)	Paired axial 2D	MAE: 35.14 ± 13.19 (93.30 ± 59.60). PSNR: 30.89 ± 2.66 (26.70 ± 3.36). SSIM: 0.912 ± 0.033 (0.887 ± 0.048)
Sun et al. 2021 ⁴⁷	Pelvis	Cycle‐GAN	Adversarial loss Cycle loss Gradient loss		Intra‐patient RR Image resampled to 384 × 192 × 192	5‐CV (80/20/20)	Paired patch 3D	MAE: 51.62 ± 4.49. SSIM: 0.86 ± 0.03. PSNR: 30.70 ± 0.78 (27.15 ± 0.57)	GPR2: 97%
Thummerer et al. 2021 ⁵⁹	Thorax	U‐Net	MAE		Intra‐patient RR and DIR Voxels outside body set to −1000 HU CT FOV cropped to CBCT	3‐CV (22/0/11)	Paired axial, sagittal, coronal 2D	MAE: 30.7 ± 4.4 ^a . ME: 2.4 ± 3.9 ^a . SSIM: 0.941 ± 0.019. PSNR: 31.2 ± 3.4	GPR3: 96.8 ± 2.4%. GPR2: 90.7%. DVH (CTV/GTV) < 0.5%
Tien et al. 2021 ⁴⁴	Breast	Cycle‐GAN	Adversarial loss Cycle loss Synthetic loss Identity loss Gradient loss	Random cropping to 128 × 128 Random horizontal/vertical flips Random rotation	Clipped images to 264 × 336 HU clipped [−950,500] HU rescaled [0,1]	12/0/3	Paired axial 2D	Average ROI HU. ROI MAE. ROI PSNR. ROI SSIM
Uh et al. 2021 ⁴⁵	Abdomen Pelvis	Cycle‐GAN	Adversarial loss Cycle loss		Intra‐patient RR Voxels outside body set to −1000 HU Body normalization: lateral extent of anatomy scaled to 475 mm CBCT and CT resampled	21/0/7 29/0/7	Paired axial 2D	MAE: 44 (141) ^a . ME: 0 ^a MAE: 51 (105) ^a . ME: 10 ^a	GPR2: 98.4% (83.0%) GPR2: 98.5% (80.9%)
Xue et al. 2021 ¹⁹	Nasopharynx	Cycle‐GAN	Adversarial loss Cycle loss Identity loss		Intra‐patient RR Voxels outside body set to −1000 HU HU clipped [−1000, 2000] HU rescaled [−1,1]	135/0/34	Paired axial 2D	MAE: 23.8 ± 8.6 (42.2 ± 17.4). RMSE: 79.7± 20.1 (134.3 ± 31.0). PSNR: 37.8 ± 2.1 (27.2 ± 1.9). SSIM: 0.96 ± 0.01 (0.91 ± 0.03)	GPR3 > 98.52% ± 3.09%. GPR2 > 96.82% ± 1.71%
Zhao et al. 2021 ⁴⁶	Pelvis	Cycle‐GAN	Adversarial loss Cycle loss Idempotent loss Gradient loss	Added noise	Voxels outside body set to −1000 HU Intra‐patient RR CBCT and CT resampled HU clipped [−1000,3095] HU rescaled [−1,1]	100/0/10	Unpaired axial 2D	MAE: 52.99 ± 12.09 (135.84 ± 41.59). SSIM: 0.81 ± 0.03 (0.44 ± 0.07). PSNR: 26.99 ± 1.48 (21.76 ± 1.95).	DVH < 50 cGy (< 350 cGy)
Wu et al. 2022 ⁶⁴	Pelvis	Deep‐CNN	Gradient loss MAE		CBCT resampled to CT Intra‐patient DIR Voxels outside body set to −1000 HU Images cropped to 440 × 440 HU rescaled [0, 1]	5‐CV (90/30/23)	Paired 2D	MAE: 52.18 ± 3.68 (352.56) ^a . SSIM: 0.67 ± 0.02 (0.56). ME: 21.72 ± 14.18 (352.41). PSNR: 29.27 ± 0.37 (20.21)
Lemus et al. 2022 ⁵²	Abdomen	Cycle‐GAN	Cycle loss Adversarial loss Gradient loss Idempotent loss Total Variation loss	Random 256 × 256 image sampling	Intra patient RR (training) Intra patient DIR (testing) Images cropped to 480 × 384	10‐CV (11/0/6)	Paired 2D	MAE: 54.44 ± 16.39 (72.95 ± 6.63). RMSE: 108.765 ± 40.54 (137.29 ± 21.19)	DVH (PTV): 1.5% (3.6%). GPR3/2: 98.35% (96%)

Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm.

Abbreviations: CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross validation; MAE, mean absolute error; ME, mean error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; ROI, regions of interest; RR, rigid registration; SSIM, structural similarity.

Image similarity metrics computed within body contour.

TABLE 4

Miscellaneous approaches for cone‐beam CT (CBCT) correction

Author and year	Anatomic site	Model	Loss function	Augmentation	Preprocessing	(train/val/test)	Training configuration	Image similarity (input CBCT)	Dose similarity
Hansen et al. 2018 ⁷³	Pelvis	U‐Net	MSE	Linear combination of two random inputs (Mixup)	A priori scatter correction for target projections	15/8/7	Paired projection 2D	MAE: 46 (144). ME: −3 (138)	GPR2: 100%. GPR1: 90%. GPR2: 53%
Jiang et al. 2019 ⁷⁴	Pelvis	U‐Net	MSE		MC scatter correction for target CBCTs	15/3/2	Paired axial 2D	RMSE: 18.8 (188.4). SSIM: 0.9993 (0.9753)
Landry et al. 2019 ⁷⁵	Pelvis	U‐Net 1	MSE	Mixup	A priori scatter correction for target projections	27/7/8	Paired projection 2D	MAE: 51 (104). ME: 1 (30)	DPR2 > 99.5%. DPR1 > 98.4%. GPR2 > 99.5% GPR3 > 95%. GPR2 > 85%. DPR3 > 75%. DPR2 > 68%
		U‐Net 2		Random left‐right flips Random position shifts Random HU shifts	Intra‐patient DIR CT resampled to CBCT Voxels outside body set to −1000 HU CT cropped to CBCT cylindrical FOV CT and CBCT cropped to remove conical ends of CBCT		Paired axial 2D	MAE: 88 (104). ME: 2 (30)	DPR2 > 99.5%. DPR1 > 98.4%. GPR2 > 99.5% GPR3 > 97%. GPR2 > 89%. DPR3 > 81%. DPR2 > 76%
		U‐Net 3			Voxels outside body set to −1000 HU		Paired axial 2D	MAE: 58 (104). ME: 3 (30)	DPR2 > 99.5%. DPR1 > 98.4%. GPR2 > 99.5%. GPR3 > 98%. GPR2 > 91%. DPR3 > 85%. DPR2 > 79%
Nomura et al. 2019 ⁷⁹	HN	U‐Net	MAE	Random left‐right flips Random 90° rotations	MC simulation of training, validation and testing data Voxels outside body set to −1000 HU Anatomy segmented into air, adipose, soft tissue, muscle, rib bone	Training: 5 phantoms. Validation: HN phantom. Testing: 1 HN, 1 Thorax patient	Paired projections 2D	MAE: 17.9 ± 5.7 (21.8 ± 5.9). SSIM: 0.9997 ± 0.0003 (0.9995 ± 0.0003). PSNR: 37.2 ± 2.6 (35.6 ± 2.3)
Nomura et al. 2019 ⁷⁹	Thorax	U‐Net	MAE	Random left‐right flips Random 90° rotations			Paired projections 2D	MAE: 29.0 ± 2.5 (32.5 ± 3.2). SSIM: 0.9993 ± 0.0002 (0.9990 ± 0.0003). PSNR: 31.7 ± 0.8 (30.6 ± 0.9)
Lalonde et al. 2020 ⁸⁰	HN	U‐Net	MAPE	Vertical and horizontal flips	Projections downsampled to 256 × 256 Projection intensities normalized against flood field	29/9/10	Paired projections 2D	MAE: 13.41 (69.64). ME: −0.801 (−28.61)	GPR2: 98.99% (68.44%)
Rusanov et al. 2021 ⁸¹	HN	U‐Net	MAE	Random vertical/horizontal flips	Bowtie filter removal via projection normalization using flood field scan	4/0/2	Paired projection 2D	MAE: 74 (318) ^a . SSIM: 0.812 (0.750)

Note: GPR3 = 3%/3 mm; GPR2 = 2%/2 mm; GPR1 = 1%/1‐mm criteria. DPR2 = 2% DD threshold; DPR1 = 1% DD threshold.

Abbreviations: DIR, deformable image registration; HN, head and neck; MAE, mean absolute error; MC, Monte Carlo; ME, mean error; MSE, mean squared error; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; SSIM, structural similarity.

Image similarity metrics computed within body contour.

Summary of synthetic CT (sCT) generation methods Voxels outside body set to −1000 HU Intra‐subject RR and DIR Masked CT to CBCT contour Intra‐subject DIR 2D patches DIR patches MAE SSIM Resample CT to CBCT Rescaled HU [0,1] Intra‐subject RR plan CT and CBCT Intra‐subject DIR replan CT and CBCT Adversarial loss Cycle loss (L1.5 norm) Synthetic loss (L1.5 norm) Gradient loss Resample CBCT to CT Intra‐subject RR Inter‐subject RR to common volume Air truncation 3D patches Adversarial loss Cycle loss Total variation loss Air loss Gradient loss Idempotent loss Voxels outside body set to −1000 HU Intra‐subject RR plan CT and CBCT Air truncation Intra‐subject DIR replan CT and CBCT HU clipped [−500, 200] HU rescaled [−1,1] Adversarial loss Cycle loss Random cropping Random left‐right flips Intra‐subject RR Voxels outside body set to −1000 HU CT/CBCT downsampled HU clipped [−1000,2071], rescaled 16 bit DD1: 89%. DD2: 100%. DVH < 1.5%. DD2: 80%. DD3: 86%. GPR2: 96%. GPR3: 100%. DVH < 1% Adversarial loss Cycle loss Synthetic loss Random left‐right flips Random positional shifts Resample CT to CBCT Intra‐patient RR Adversarial loss Cycle loss Identity loss Resample CT to CBCT HU rescaled [−1,1] Intra‐patient DIR on test data Perceptual loss Adversarial loss Adversarial loss Cycle loss Synthetic loss Thorax and HN HU clipped [−1000,4000] Pelvis HU clipped [−1000,1000] HU rescaled [−1,1] Intra‐patient RR Images resampled 224 × 224 Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm. Abbreviations: CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross‐validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross‐validation; MAE, mean absolute error; ME, mean error; MSE, mean squared error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; ROI, regions of interest; RR, rigid registration; SSIM, structural similarity. Image similarity metrics computed within body contour. Summary of synthetic CT (sCT) generation methods Adversarial loss Cycle loss Synthetic loss Intra‐patient RR and DIR CBCT resampled to CT HN Lung Breast Adversarial loss Cycle loss Random left‐right flipping Random 30 × 30 cropping Voxels outside largest circular mask on CBCT and CT set to −1000 HU Intra‐patient RR Images resampled 286 × 286 HU clipped [−1024,3071] HU rescaled [0,1] CT anatomy outside CBCT FOV stitched on 15/8/10 15/8/10 15/8/10 MAE: 51 ± 12 (195 ± 20) . ME: −6 ± 6 (−122 ± 33) MAE: 86 ± 9 (219 ± 44) . ME: −5 ± 14 (153 ± 48) MAE: 67 ± 18 (152 ± 40) . ME: −5 ± 11 (71 ± 37) GPR3: 99.3 ± 0.4%. GPR2: 97.8 ± 1% GPR3: 98.2 ± 1%. GPR2: 94.9 ± 3% GPR3: 97 ± 4%. GPR2: 92. ± 8% Adversarial loss Cycle loss Voxels outside body set to −1000 HU Intra‐patient RR and DIR CT and CBCT masks reduced to common voxels Slices containing shoulders removed Small translations Random left‐right mirroring Voxels outside body set to −1000 HU Intra‐patient RR and DIR CT and CBCT masks reduced to common voxels Slices containing shoulders removed Intra‐patient RR Images cropped 256 × 256 Central 52 slices used Feature matching MAE Random left–right flipping Random small angle rotation Background noise Intra‐patient DIR HU rescaled to mean of 0, STD of 1 (standardized) Adversarial loss MAE CBCT artifact injection into CT HU clipped [−1000, 3095] HU rescaled [−1, 1] Intra‐patient DIR Image resampled to 128 × 128 × 128 Adversarial loss Cycle loss Adversarial loss Cycle loss Identity loss Images resampled 1 × 1 × 1‐mm grid HU rescaled [−1, 1] Voxels outside body set to −1000 HU Adversarial loss Cycle loss Identity loss Intra‐patient RR CT FOV cropped to CBCT HU clipped [−1000, 1500] HU rescaled [−1, 1] Images resampled 256 × 256 Adversarial loss Attribute consistency loss Reconstruction loss Self‐reconstruction loss SSIM loss Resample CT/CBCT to 1 × 1 × 1‐mm grid Resample to 384 × 384 Extract 256 × 256 image patches HU clipped [−1000, 2000] HU rescaled [−1, 1] Intra‐patient RR Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm. Abbreviations: ADN, artifact disentanglement network; CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross‐validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross validation; MAE, mean absolute error; ME, mean error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; RR, rigid registration; SSIM, structural similarity. Image similarity metrics computed within body contour. Summary of synthetic CT (sCT) generation methods Adversarial loss Cycle loss Histogram matching loss Synthetic loss Gradient loss Perceptual loss Rotations Flips Rescaling Rigid deformations Random 90° rotations Horizontal flip Voxels outside body set to −1000 HU HU clipped [−1024, 3200] HU rescaled [0, 1] Intra‐patient RR Image resampled to 256 × 256 Adversarial loss Cycle loss Gradient loss Intra‐patient RR Image resampled to 384 × 192 × 192 Intra‐patient RR and DIR Voxels outside body set to −1000 HU CT FOV cropped to CBCT Adversarial loss Cycle loss Synthetic loss Identity loss Gradient loss Random cropping to 128 × 128 Random horizontal/vertical flips Random rotation Clipped images to 264 × 336 HU clipped [−950,500] HU rescaled [0,1] Abdomen Pelvis Adversarial loss Cycle loss Intra‐patient RR Voxels outside body set to −1000 HU Body normalization: lateral extent of anatomy scaled to 475 mm CBCT and CT resampled 21/0/7 29/0/7 MAE: 44 (141) . ME: 0 MAE: 51 (105) . ME: 10 GPR2: 98.4% (83.0%) GPR2: 98.5% (80.9%) Adversarial loss Cycle loss Identity loss Intra‐patient RR Voxels outside body set to −1000 HU HU clipped [−1000, 2000] HU rescaled [−1,1] Adversarial loss Cycle loss Idempotent loss Gradient loss Voxels outside body set to −1000 HU Intra‐patient RR CBCT and CT resampled HU clipped [−1000,3095] HU rescaled [−1,1] Gradient loss MAE CBCT resampled to CT Intra‐patient DIR Voxels outside body set to −1000 HU Images cropped to 440 × 440 HU rescaled [0, 1] Cycle loss Adversarial loss Gradient loss Idempotent loss Total Variation loss Intra patient RR (training) Intra patient DIR (testing) Images cropped to 480 × 384 Note: Dose similarity in italics suggests proton plans, otherwise photon plans. GPR3 = 3%/3 mm; GPR2 = 2%/2 mm. Abbreviations: CBCT, cone‐beam CT; CNN, convolutional neural network; CV, cross validation; DIR, deformable image registration; DVH, dose–volume histogram; GAN, generative adversarial network; HN, head and neck; LoO‐CV, leave on out cross validation; MAE, mean absolute error; ME, mean error; P/C/GTV, planning/clinical/gross target volume; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; ROI, regions of interest; RR, rigid registration; SSIM, structural similarity. Image similarity metrics computed within body contour. Miscellaneous approaches for cone‐beam CT (CBCT) correction DPR2 > 99.5%. DPR1 > 98.4%. GPR2 > 99.5% GPR3 > 95%. GPR2 > 85%. DPR3 > 75%. DPR2 > 68% Random left‐right flips Random position shifts Random HU shifts Intra‐patient DIR CT resampled to CBCT Voxels outside body set to −1000 HU CT cropped to CBCT cylindrical FOV CT and CBCT cropped to remove conical ends of CBCT DPR2 > 99.5%. DPR1 > 98.4%. GPR2 > 99.5% GPR3 > 97%. GPR2 > 89%. DPR3 > 81%. DPR2 > 76% Random left‐right flips Random 90° rotations MC simulation of training, validation and testing data Voxels outside body set to −1000 HU Anatomy segmented into air, adipose, soft tissue, muscle, rib bone Training: 5 phantoms. Validation: HN phantom. Testing: 1 HN, 1 Thorax patient Projections downsampled to 256 × 256 Projection intensities normalized against flood field Note: GPR3 = 3%/3 mm; GPR2 = 2%/2 mm; GPR1 = 1%/1‐mm criteria. DPR2 = 2% DD threshold; DPR1 = 1% DD threshold. Abbreviations: DIR, deformable image registration; HN, head and neck; MAE, mean absolute error; MC, Monte Carlo; ME, mean error; MSE, mean squared error; PSNR, peak signal‐to‐noise ratio; RMSE, root mean square error; SSIM, structural similarity. Image similarity metrics computed within body contour.

RESULTS

Study identification

The initial identification search returned 218 investigations on Scopus, 119 on PubMed, and 180 on Web of Science for a total of 517 investigations. After screening for eligibility, 40 studies qualified, of which 34 investigated sCT generation and 6 utilized projection domain approaches. The distribution of the total number of investigations per year is presented in Figure 3 and is further broken down into network type. The number of investigations in DL‐based CBCT correction has grown each year, with the first investigations performed in 2018. U‐Net was the preferred architecture in 2018; however, preference for cycle‐GAN grew rapidly in 2019 which kept it tied with U‐Net from 2019 to 2020, thereafter becoming the most popular architecture. Figure 4 depicts the share of anatomic regions investigated by percentage, with the pelvic and head‐and‐neck (HN) region being the most thoroughly covered, both making up 70% of all sites. The thoracic region made up 23% of all studies includes both lung and breast patients. The least investigated region was the abdomen comprising just 7% of all studies.

FIGURE 3

Distribution of total and per architecture investigations per year

FIGURE 4

Pie chart of distribution of anatomic sites investigated

Distribution of total and per architecture investigations per year Pie chart of distribution of anatomic sites investigated

sCT generation

Network architectures

Tables 3a–c show that a total of nineteen studies investigated sCT generation primarily using cycle‐GAN for their translation task, , , , , , , , , , , , , , , , , , , compared to eight utilizing U‐Net, , , , , , , , three implementing GANs, , , three exploring deep‐CNNs, , , and one study utilizing a novel architecture called artifact disentanglement network (ADN). Deep‐CNNs maintain the input image dimensions as the feature maps flow through the network. ADN, originally used for metal artifact reduction in CT imaging, has been utilized for sCT generation. By “disentangling” domain‐specific features corresponding to style, whilst mapping a common structure feature space between CT and CBCT data, a transformation can be learnt that decomposes CBCT style from structure then reassembles an sCT image using CT style and CBCT structure. The most objective comparisons of image quality can be found in studies that compare multiple architectures or correction techniques for generating sCT images using the same datasets. , , , , , , , , , , , , Where DL methods were compared to classical CBCT correction methods, Barateau et al. demonstrated that their GAN sCT achieved a lower MAE than DIR of the CT (82.4 ± 10.6 vs. 95.5 ± 21.2 HU), which was found to be consistent with the results in Thummerer et al. (36.3 ± 6.2 vs. 44.3 ± 6.1 HU). Similarly in Liang et al., cycle‐GAN showed improved image quality metrics over DIR of the CT when a saline‐adjustable phantom was used in a controlled experiment. In terms of photon dosimetry, Barateau et al., Lemus et al., and Maspero et al. concluded that sCT images perform similarly to deformed CT images for HN, lung, and breast regions, whereas Thummerer et al. found that the same was also true for proton plans in the HN region. When comparisons were made between architectures, Liang et al. reported an improved MAE for cycle‐GAN over two GAN implementations (29.85 ± 4.94 vs. 39.71 ± 10.79 and 40.64 ± 6.99 HU), along with superior anatomical preservation for an adjustable saline‐fillable HN phantom. Likewise, Sun et al. showed that their cycle‐GAN could produce lower MAE than an equivalent GAN (51.62 ± 4.49 vs. 56.53 ± 5.26 HU) and resulted in better Dice score agreement for various structures. Gao et al. also demonstrated a lower MAE for their unpaired cycle‐GAN implementation over a paired GAN network (43.5 ± 6.69 vs. 53.4 ± 9.34 HU). Visual inspection showed bone, air, and certain lung structures were incorrectly synthesized for the GAN, with poor structural continuity along sagittal and coronal images. Liu et al. compared U‐Net to Cycle‐GAN and noted a substantial reduction in MAE (66.71 ± 15.82 vs. 56.89 ± 13.84 HU) and reduced artifact severity for the latter network. Similarly, Tien et al. reported better HU agreement within lung region ROIs for cycle‐GAN over U‐Net and undertook a blind observer test which scored cycle‐GAN sCT images at 4.5/5 and U‐Net sCT images at 1.3/5 based on image realism. Xue et al. and Zhang et al. tested cycle‐GAN, GAN, and U‐Net on the same datasets and demonstrated the lowest MAE for cycle‐GAN over GAN when model configurations were kept constant (23.8 ± 8.6 vs. 24.3 ± 8.0 HU and 8.9 ± 3.1 vs. 9.4 ± 1.2 HU ). The U‐Net‐based models, however, performed noticeably worse with MAEs increasing to 26.8 ± 10.0 HU in Xue et al. and 19.2 ± 6.4 HU in Zhang et al. Aside from image quality, Xue et al. showed explicitly the anatomy preserving capacity of cycle‐GAN over GAN and U‐Net: Structures pertaining to contrast enhancement solution present in CT images were falsely generated on U‐Net/GAN sCT images but were suppressed in cycle‐GAN sCT images. Liu et al. demonstrated the effectiveness of their unpaired ADN network over three variants of cycle‐GAN. Their approach to image disentanglement reduced CBCT MAE from 70.56 ± 11.81 to 32.70 ± 7.26 HU, whereas three cycle‐GAN approaches resulted in sCT MAEs of 42.04 ± 8.84 HU (base cycle‐GAN), 43.90 ± 8.23 HU (cycle‐GAN with larger generator), and 36.26 ± 7.00 HU (cycle‐GAN with attention gating). Furthermore, visual inspection showed reduced noise and motion artifacts for ADN over the cycle‐GAN variants. Figure 5 shows the mean percent improvement in sCT MAE over the base CBCT for different networks when a common dataset was used. The greatest disparity was observed between cycle‐GAN and U‐Net (51.0% vs. 36.8% respectively, p = 0.37), , , whereas a smaller difference was noted between cycle‐GAN and GAN networks (58.3% vs. 51.5% respectively, p = 0.24). , , ,

FIGURE 5

Percent mean absolute error (MAE) improvement per network for studies utilizing common data

Training configuration

Ten cycle‐GAN studies were trained with paired training configurations, whereas eight were not. Figure 6 shows the mean percentage MAE improvement for the two most common anatomical regions, as well as the overall percentage improvement for all studies that reported the original CBCT and sCT MAE metric. For the pelvic region, two studies used paired data, , whereas three utilized unpaired images to train their cycle‐GAN. , , Subsequently, three investigations focusing on the HN region used paired data, , , whereas two used unpaired. , The paired studies showed a greater improvement (61.41% vs. 51.35%, p = 0.32) in the pelvic region, whereas the unpaired networks performed better in the HN region (65.38% vs. 47.24%, p = 0.04). Comparing all studies that trained either with paired or unpaired data, unpaired implementations recovered a better improvement in MAE on average (55.98% vs. 47.61%, p = 0.16). However, it must be noted that among other network nuances, unpaired networks were trained on an average of ∼54 patients, compared to an average of ∼33 for paired networks. By selecting studies with training set sizes within ±4 patients between the two groups, , , , , , , , the difference between paired and unpaired networks is reduced, with paired networks performing slightly better (53.65% vs. 51.83%, p = 0.29).

FIGURE 6

Percent mean absolute error (MAE) improvement for cycle‐generative adversarial network (GAN) models trained with paired or unpaired datasets, controlling for pelvic, head and neck (HN), and all anatomical regions, as well as training set sizes within four patients Most networks were trained using axial 2D slices, with the exception of , , , that utilized 3D patch‐based training, and Dahiya et al. who used entire 3D volumes. The main advantage of 3D training is improved feature extraction as medical images are volumetric in nature. Zhang et al. trained in 2.5D, utilizing adjacent axial slices to help the model predict the central slice. Alternatively, Thummerer et al. , trained three separate models on the same data organized in sagittal, coronal, and axial planes. The median value for a given pixel was taken as the prediction. Within the GAN literature, all studies were performed using paired data, with Barateau et al. training in 2D, Zhang et al. in 2.5D, and Dahiya et al. in 3D. The respective improvement in MAE over the original CBCT was 69.09%, 46.12%, and 81.99%, suggesting that the use of 3D convolutions on entire image volumes is highly advantageous. Interestingly, in the Zhang et al. study, no difference was observed between training in 2D and 2.5D. For studies utilizing cycle‐GAN, the mean percentage improvement in MAE for 2D approaches , , , , , , , , , was slightly higher than 3D‐patch based approaches , , (52.08% vs. 49.83%, p = 0.41). When controlling for training set sizes within ± 4 patients, a similar trend was observed with 2D networks , still slightly outperforming patched 3D networks , , (51.51% vs. 49.83%, p = 0.46). On the contrary, Sun et al. did note a slight increase in PSNR of their patched 3D cycle‐GAN over a 2D implementation (30.70 ± 0.78 vs. 29.72 ± 0.59), although the analysis was not comprehensive and lacked other image quality analyses. One novel approach by Chen et al. used a dual‐channel input U‐Net to create sCT images using intensity information from CT images and structural information from CBCT images. Having access to original planning CTs and same‐day replan CT images, the authors created a dual channel dataset containing the RR CBCT and corresponding planning CT images. This dataset was fed into the network, with replan CT images used as the ground truth for optimization. The authors noted a visual reduction in artifacts using the dual channel over the standard approach, with percentage MAE improvement increasing from 50.68% to 56.80%.

Preprocessing

Image registration and resampling is used to bring images into the same coordinate space such that voxels between two datasets contain compatible biological information. Hence, image registration is necessary to provide the most anatomically accurate ground truth for both training and inference. As such, image registration is used ubiquitously, with DIR being used for training in most U‐Net and deep‐CNN architectures, with the exception of , , which used RR only. GAN‐based studies all used DIR given their susceptibility of generating false anatomies. Cycle‐GAN, originally designed for unpaired data, was most commonly coupled with RR, for both paired , , , , , and unpaired approaches , , , , . Alternatively, Liu et al. and Qiu et al. applied DIR to their training set for better anatomical correspondence, whereas Liang et al. and Dong et al. did not apply any registration to their training data, only resampling to the same grid. Liu et al. was the only cycle‐GAN study to investigate the impact of DIR and RR preprocessing on the same dataset and found that DIR produced slightly better HU agreement (56.89 ± 13.84 vs. 58.45 ± 13.88 HU), with substantially less noise and sharper organ boundaries upon visual inspection. Some authors applied a secondary intersubject RR to a common patient such that all volumes were closely centered, allowing for substantial truncation of air regions to reduce the computational load. , Meanwhile Uh et al. performed a novel body normalization technique to equalize the extent of pediatric patients’ lateral anatomy that significantly reduced the MAE of their composite model (47 ± 7 vs. 60 ± 7 HU, p < 0.01). Other than registration, the most common preprocessing techniques involved clipping and normalizing HU values to between [0,1] or [−1,1], , , , , , , , , , , , , or alternatively standardizing intensities to minimize biasing gradients. Dong et al. investigated slice‐wise versus patient‐wise normalization and found the former resulted in slice discontinuity artifacts, whereas the latter resulted in superior image quality. Another common technique was to replace voxels outside the patient body contour with air to minimize the impacts from nonanatomical structures. , , , , , , , , , , ,

Loss metrics

With the exception of Xie et al., all U‐Net‐ and deep‐CNN‐based architectures were constrained by pixel‐wise loss functions with the most common being L1 loss. Chen et al. applied the SSIM image quality assessment metric as a loss function. SSIM computes statistical terms corresponding to structure, luminance, and contrast and discovered that SSIM alone improved the percentage MAE by 47.05%, whereas L1 loss alone increased it to 50.68%. By utilizing both losses, the percent improvement increased to 51.15% suggesting an additive relationship that was corroborated in a natural image restoration study. Of the studies investigating cycle‐GAN, four made no alterations to the standard loss, , , , whereas fifteen made substantial alterations. , , , , , , , , , , , , , Networks considered unmodified used the default adversarial and cycle losses. The percent MAE improvement for cycle‐GAN networks using standard versus extended loss functions was 49.08 ± 21.85 and 49.90 ± 14.37 HU, respectively (p = 0.46). The perceptual SSIM metric is better able to quantify changes in image appearance relating to artifacts and image realism. The percentage SSIM improvement for networks using standard versus extended loss functions was 10.5 ± 5.95% and 12.8 ± 5.73%, respectively (p = 0.35), suggesting that sCTs from the latter networks were perceptually closer to real CT images. Extensions to the cycle‐GAN global loss included identity loss, , , , gradient loss, , , , , , , synthetic loss, , , , , , L1.5 loss, histogram matching loss, idempotent loss, , air loss, total variation loss, , feature matching loss, and perceptual loss. The identity loss , , , aids network stability and generator accuracy by ensuring no additional effect occurs to real images when they are input into generators tasked to output images in the same domain. In their larger ablation experiment, Zhang et al. compared cycle‐GAN with and without the identity loss, demonstrating a slight improvement in sCT MAE (8.9 ± 3.1 vs. 9.2 ± 1.5 HU). The gradient loss is used to either preserve structural details during conversion or enhance edge sharpness. One approach uses the Sobel operator to compute the gradient map of sCT and CBCT images during optimization. The pixel‐wise error between the two gradient maps is then minimized to maintain the same edge boundaries between CBCT and sCT images. The second approach attempts to equalize the neighboring pixel‐wise intensity variations between cycle/real and synthetic/real image pairs, thereby maintaining the same level of noise and edge sharpness in both cycle and synthetic images as real images. , This technique was utilized by Sun et al. resulting in a noticeable visual improvement in edge sharpness in sCT images generated by a network trained with and without gradient loss. In cases where data is paired and well registered, the synthetic loss is applied in a similar fashion to U‐Net implementations to enforce pixel‐wise similarity between generated and target domain images, typically using the L1 distance. Alternatively, the histogram matching loss used in Qiu et al. attempts to maximize the similarity between input and cycle‐synthetic histogram distributions globally, further constraining the model parameters to output accurate tissue densities. Their cycle‐GAN network trained with histogram matching achieved an sCT MAE of 66.2 ± 8.2 HU compared to 72.8 ± 11.5 HU without the loss. Visual inspection confirmed a more uniform soft tissue distribution reflecting real tissue densities. Tien et al. incorporated both gradient and synthetic losses into their cycle‐GAN model and performed a blind observer test to assess how closely sCT perceptual image quality matched the CT. The unmodified cycle‐GAN achieved 3.3/5 in the blind test, whereas the proposed method scored 4.5/5. The L1.5 loss used in Harms et al. merges the benefits of L1 and L2 losses. L1 loss may lead to inconsistent HU reproduction as it is more difficult to optimize (tends toward sparse solutions and results in a noncontinuous optimization function). Conversely, L2 is easier to solve as solutions to all parameters lie on a continuous function in optimization space. However, heightened sensitivity to outliers results in blurring of anatomical boundaries primarily because outliers lie at the boundaries. The L1.5 norm produces a more stable optimization function whilst not weighing outliers as heavily as L2 norm, resulting in greater model stability and increased output accuracy. , , The idempotent, air, and total variation losses were introduced in Kida et al. The idempotent loss is similar to the identity loss, but functions by minimizing the difference between a synthetic image, and the same image fed through a generator tasked to output images in the same domain as the original synthetic image. The air loss is a piece‐wise function that encourages the preservation of air pockets and the body contour by penalizing mismatches. The function equals zero if respective densities of both sCT and CBCT images are greater than −465 HU, else the output error is equal to the density of the L1 norm of CBCT and sCT density differences. The total variation loss is used as an image denoising technique that works by minimizing the absolute difference of pixel intensities in an image and its vertically and horizontally translated version. Zhang et al. introduced the feature matching loss that modifies the typical adversarial loss used in GANs. Instead of using the classification output of the discriminator as the minimization target for the generator, intermediate level feature maps for real and synthetic inputs at the discriminator are extracted and used as a minimization criterion while optimizing the generator. Hence, the network can be optimized by exploiting levels of abstraction and not just the image domain, in turn aiding network stability. Finally, Qiu et al. and Barateau et al. incorporated a content‐based perceptual loss function into their model, whereas Zhang et al. combined a style and content‐based function in their perceptual loss. Content‐based perceptual loss is said to minimize the structure or content differences between two images based on their deep feature representations, rather than in the image domain. The loss is computed by finding the L2‐norm of selected deep feature maps contained in a pretrained CNN that is fed both images. The style‐based perceptual loss is based on the same principle; however, spatial information is lost by first computing the Gram matrix of selected feature maps, followed by the minimization of the Frobenius norm between the Gram matrix of each image. , The MAE for a cycle‐GAN trained with and without the content‐based perceptual loss improved from 82.0 ± 17.3 to 72.8 ± 11.5 HU in the work by Qiu et al., whereas Zhang et al. found their style and content‐based perceptual loss disturbed training and resulted in an increased MAE from 8.1 ± 1.3 to 9.2 ± 1.5 HU. This was likely because Qiu et al. used a segmentation CNN pretrained on medical images, whereas Zhang et al. used VGG‐16 (Visual Geometry Group) pretrained on natural images that failed to adequately capture feature peculiarities of medical images.

Network blocks

CNNs typically contain parameter counts in the order of tens or hundreds of millions. However, a major issue resulting from high parameter counts is the degradation of network performance in proportion to network depth—a phenomenon not caused by overfitting. Hence, deeper models are more difficult to optimize and may converge at a higher error than shallower models. The residual block, used by many authors, , , , , , , , , , , , , , , solves the degeneration of deep models by introducing a skip connection that circumvents one or more convolution layers by transmitting upstream signals downstream via element‐wise addition. Hence, the residual block encourages the local mapping, H(x), to learn a residual function, F(x) = H(x) − x, if x is the upstream signal. It is empirically shown that learning the residual version of the mapping is easier than learning an unreferenced mapping as the optimal function tends to be close to an identity mapping, necessitating only a small response in H(x). The inception block was designed to detect features at different scales by extracting image information using multiple kernel sizes. Typically, a single convolution filter size is used throughout the network. With inception blocks, multiple filters are arranged in parallel. For example, Tien et al. combined 1 × 1, 5 × 5, 9 × 9, and 13 × 13 filters in to extract features at multiple resolutions, concatenating the results into a single feature map volume, and performing dimensionality reduction using a 1 × 1 convolution. In addition, the inclusion of 1 × 1 convolutions enables the network to learn cross‐channel patters. The resulting operation has a higher capacity to extract useful features hence improve network performance. Attention gates give priority to salient anatomical information while suppressing feature responses pertaining to noise or background information. Integrated into the U‐Net architecture, attention gates operate along the skip connections that propagate encoder‐side features to the decoder. The skip connections may propagate many redundant features that compromise the accuracy of the decoder. The attention gate learns to suppress redundant features by applying an element‐wise multiplication of incoming encoder‐side feature maps with attention coefficients that range between [0, 1]. These coefficients are learnt during backpropagation, thereby allowing only relevant information to reach the decoder. , , The impact of attention‐gated cycle‐GAN was analyzed in Liu et al. (2020) and Liu et al. (2021). In both studies, the addition of attention gates over an unmodified cycle‐GAN improved the MAE from 63.72 ± 10.55 to 56.89 ± 13.84 HU and 42.04 ± 8.84 to 36.26 ± 7.00 HU. Gao et al. infused attention guidance into their GAN to prompt the network to pay more attention to specific problematic regions. The decoder arm of their generator outputs both foreground attention masks and background attention masks. Foreground masks prompt the network to focus on image regions that change during synthesis, whereas background masks contain unchanging regions. By separating network attention to changing and unchanging regions, network performance improved over a base cycle‐GAN from 47.1 ± 6.65 to 43.5 ± 6.45 HU.

Cohort size

The impact of cohort sizes can most readily be appreciated by examining studies that used the same model. Yuan et al. conducted an explicit analysis on the impact of training sizes where they trained a U‐Net using different permutations of 50, 40, and 30 patients. The authors split the 50 patients into 5 groups of 10 and subsequently trained 5 models with 50 patients (all data), 5 models with 40 patients (omitting one group of 10 for each model), and 5 models with 30 patients (omitting 2 groups of 10 for each model) for a total of 15 models. The authors concluded that 30 patients were insufficient to obtain a well‐trained model as significant differences were noted between using 30 compared to 40 and 50 patients (p < 0.05). Conversely, no significant difference was observed between models trained with 40 or 50 patients (p > 0.1). Mean MAE across models trained with 50, 40, and 30 patients were 53.90 ± 0.79, 53.20 ± 1.06, and 54.65 ± 0.43 HU respectively, indicating that larger training sets do not necessarily result in lower MAE. Eckl et al. investigated training a cycle‐GAN model on HN and pelvic images. The mean MAE for HN sCT images was 77.2 ± 12.6 HU for a model trained on 25 patients, whereas the same model trained on 205 pelvic patients achieved an MAE of 41.8 ± 5.3 HU. Although the HN region incurs less scatter contamination and anatomical variability compared to the pelvis, the pelvic sCT images were of higher absolute quality as a result of the eightfold increase in training data. The mean number of training and testing patients for all studies and studies broken down into anatomic regions is presented in Table 5, along with the mean CBCT MAE, mean sCT MAE, and relative percentage MAE improvement of the sCT. Note that the asterisk signifies only studies that reported base CBCT and sCT MAE. For all studies reviewed, the size of training sets ranged from 8 to 205 for any given anatomic region, with an average of 47.74 ± 47.11 patients. Testing sets ranged from 3 to 34 and had a mean of 11.02 ± 7.90 patients. Studies investigating the HN region utilized on average the smallest training size (45.63 ± 39.38) but produced the largest percent improvement in MAE (58.67 ± 10.75%). Investigations of the pelvic and thoracic region used more training data (62.63 ± 43.20 and 59.00 ± 56.18, respectively) but resulted in similar improvements in MAE to the HN region. However, studies focusing on the abdomen typically used less than half of the training data as other regions (20.33 ± 7.36) and produced the lowest percent MAE improvement of 41.33 ± 19.51% on average. In absolute terms, HN studies started with the lowest CBCT MAE of 106.59 ± 84.63 HU and likewise synthesized CT images with the lowest MAE of 36.13 ± 21.76 HU. The pelvic region had a slightly higher sCT average of 41.58 ± 22.73 HU, followed by the abdomen at 51.78 ± 5.59 HU, and the thorax at 54.12 ± 20.47 HU.

TABLE 5

Mean cohort size and model performance statistics for all publications

	Training size (no patients)	Testing size (no. patients)	CBCT MAE (HU)	sCT MAE (HU)	% MAE improvement
All studies	47.74 ± 47.11	11.02 ± 7.90	–	46.83 ± 22.23	–
All studies*	51.27 ± 44.27	11.77 ± 8.77	114.66 ± 75.84	45.13 ± 22.01	54.60 ± 17.90
Pelvis*	62.63 ± 43.20	10.88 ± 6.11	117.47 ± 93.73	41.58 ± 22.73	57.97 ± 19.68
HN*	45.63 ± 39.38	12.13 ± 10.17	106.59 ± 84.63	36.13 ± 21.76	58.67 ± 10.75
Thorax*	59.00 ± 56.18	14.17 ± 9.46	134.52 ± 49.45	54.12 ± 20.47	57.54 ± 12.65
Abdomen*	20.33 ± 7.36	4.67 ± 2.62	98.34 ± 30.35	51.78 ± 5.59	41.33 ± 19.51

Note: * indicates studies which reported CBCT and sCT MAE values.

Abbreviations: CBCT, cone‐beam CT; HN, head and neck; MAE, mean absolute error; sCT, synthetic CT.

Mean cohort size and model performance statistics for all publications Note: * indicates studies which reported CBCT and sCT MAE values. Abbreviations: CBCT, cone‐beam CT; HN, head and neck; MAE, mean absolute error; sCT, synthetic CT. To analyze the impact of training set sizes and model performance across studies, the scatter plot in Figure 7 presents the best linear fit to the data. The regression line shows a small but positive relationship between training set size and relative MAE improvement, but large variability between studies. The small r‐squared value indicates that training set size alone is a poor predictor for model performance. Majority of points are concentrated between 15 and 45 patients and show a high variance. The variance decreases at higher cohort sizes, whereas MAE improvements are marginal. This suggests a saturation of model performance with diminishing returns as training set size increases beyond 50.

FIGURE 7

Scatter plot demonstrating the relationship between training cohort size and percent mean absolute error (MAE) improvement

Scatter plot demonstrating the relationship between training cohort size and percent mean absolute error (MAE) improvement Figure 8 lists the absolute MAE for every publication ordered from highest to lowest. Superimposed for each study is the number of training patients used, along with information pertaining to network architecture, training arrangement, network modifications, and anatomical site. No clear relationship between sCT MAE and training size is evident given that raw MAE values are strongly influenced by the original CBCT image quality, anatomical site, and whether or not the MAE was calculated over the entire image or the body contour. Figure 9 looks at the percentage improvement in MAE for every publication that listed the original CBCT MAE ordered from lowest to highest, along with the training set size. Aside from noting that the top two networks by Wu et al. and Dahiya et al. used relatively large cohorts of 90 and 140 patients, respectively (compared to the mean of 51.27 ± 44.27), no discernible trend could be noted among the other works, corroborating the relationship in Figure 7. Figure 10 visualizes the percentage SSIM improvement against training set size to assess whether perceptual image quality improved with greater training cohorts. Once again, no clear relationship presents itself as the performance of a model seems less dependent on training set size compared to other study nuances.

FIGURE 8

FIGURE 9

Percentage mean absolute error (MAE) improvement ordered from lowest to highest compared against training set size. Publication format describes: (model architecture/supervision type + additional loss functions and/or 3D training) | anatomical region. +, additional loss functions/3D input; A, abdomen; ADN, artifact disentanglement network; C, cycle‐GAN; CNN, convolutional neural network; D, deep CNN; G, GAN; GAN, generative adversarial network; HN, head and neck; P, paired training; P, pelvis; T, thorax; U, U‐Net; Un, unpaired training

FIGURE 10

Percentage structural similarity (SSIM) improvement ordered from lowest to highest compared against training set size. Publication format describes: (model architecture/supervision type + additional loss functions and/or 3D training) | anatomical region. *, low dose CBCT; +, additional loss functions/3D input; A, abdomen; ADN, artifact disentanglement network; C, cycle‐GAN; CNN, convolutional neural network; D, deep CNN; G, GAN; GAN, generative adversarial network; HN, head and neck; P, paired training; P, pelvis; T, thorax; U, U‐Net; Un, unpaired training

Absolute synthetic CT (sCT) mean absolute error (MAE) ordered from highest to lowest compared against training set size. Publication format describes: (model architecture/supervision type + additional loss functions and/or 3D training) | anatomical region. +, additional loss functions/3D input; A, abdomen; ADN, artifact disentanglement network; C, cycle‐GAN; CNN, convolutional neural network; D, deep CNN; G, GAN; GAN, generative adversarial network; HN, head and neck; P, paired training; P, pelvis; T, thorax; U, U‐Net; Un, unpaired training. Percentage mean absolute error (MAE) improvement ordered from lowest to highest compared against training set size. Publication format describes: (model architecture/supervision type + additional loss functions and/or 3D training) | anatomical region. +, additional loss functions/3D input; A, abdomen; ADN, artifact disentanglement network; C, cycle‐GAN; CNN, convolutional neural network; D, deep CNN; G, GAN; GAN, generative adversarial network; HN, head and neck; P, paired training; P, pelvis; T, thorax; U, U‐Net; Un, unpaired training Percentage structural similarity (SSIM) improvement ordered from lowest to highest compared against training set size. Publication format describes: (model architecture/supervision type + additional loss functions and/or 3D training) | anatomical region. *, low dose CBCT; +, additional loss functions/3D input; A, abdomen; ADN, artifact disentanglement network; C, cycle‐GAN; CNN, convolutional neural network; D, deep CNN; G, GAN; GAN, generative adversarial network; HN, head and neck; P, paired training; P, pelvis; T, thorax; U, U‐Net; Un, unpaired training

Augmentation

Augmentation of training data is a strategy used to synthetically increase the number of examples to prevent overfitting to the specific variance of the training set and improve model generalizability. The most popular augmentation techniques were random horizontal flips, , , , , , , , , followed by random rotations. , , , , , The addition of noise, , , translational shifts, , , random crops, , , random shears, , and scaling were also utilized in the literature. Although most augmentation strategies consist of simple linear transformation of the image, the addition of random noise at every iteration degrades image quality yet aids in the learning of salient features by making the network more robust to overfitting (small changes in latent space do not alter the output) while improving generalizability (multiple representations of the same feature are mapped to same output). Random crops present a more functional augmentation strategy that simultaneously increases training set size whilst reducing computational load by reducing input image dimensions.

Model generalizability

Of the studies that examined multiple anatomies, several authors further investigated whether composite models (trained/tested using multiple anatomic sites) could generalize as well as intra‐anatomy models (trained/tested using single anatomic site). , Other authors explored whether inter‐anatomy models (tested on different anatomic site) would improve image quality at all. , In Maspero et al., the composite models were outperformed by intra‐anatomy models by a very small margin (51 ± 12 vs. 53 ± 12 HU), leading the authors to conclude that a single composite model could generalize as well as intra‐anatomy models. On the contrary, Uh et al. found that a single composite model outperformed their intra‐anatomy model (47 ± 7 vs. 51 ± 8 HU), suggesting that single‐composite models benefit from training on multiple anatomic regions. Liang et al. and Chen et al. showed that their HN trained models could generalize well on pelvic data, demonstrating a percent MAE improvement of 58.10% and 55.11%, respectively. Meanwhile, Zhang et al. used their pelvic trained GAN on HN data to recover a 25.39% improvement in MAE. This implies that the HN rather than pelvic region contains richer features for more generalizable inter‐anatomic models. Furthermore, Chen et al. utilized transfer learning by retraining the intra‐anatomy HN model weights on a small subset of inter‐anatomy pelvic data and compared that strategy to no model tuning. The transfer learning strategy further reduced sCT MAE from 46.78 to 42.40 HU.

Segmentation accuracy

Segmentation accuracy can be thought of as a proxy measure for image quality and a useful metric for assessing the extent of anatomical preservation during image synthesis. To compare the preservation of anatomical structures, Lemus et al. manually contoured CBCT, sCT, and DIR CT images of abdominal patients, with the CBCT used as reference. The sCT images consistently outperformed the DIR CT for each anatomical structure, with a mean Dice coefficient of 0.92 ± 0.04 for the sCT and 0.82 ± 0.06 for the DIR CT. Eckl et al. compared a manual segmentation performed on the sCT to the one performed on the CT image and later deformed to the sCT as an indirect measure of image quality. The HN region scored the lowest Dice scores ranging from 60.6 ± 10.4 to 79.3 ± 5.8, whereas thoracic and pelvic regions did not go below 81.0%, with the exception of the seminal vesicles. The mean Dice for HN, thoracic, and pelvic regions was 73.33 ± 7.83, 87.03 ± 10.30, and 81.03 ± 18.50, respectively. This suggests that smaller structures in the HN and pelvis were more prone to errors, with the seminal vesicles scoring 66.7 ± 8.3. However, it is difficult to differentiate image quality from improper DIR. Sun et al. and Zhao et al. compared automatically generated segmentations of organs at risk (OAR) to manual delineations for the pelvic region. In Sun et al., auto‐segmentation Dice scores for sCT images generated by cycle‐GAN did not fall below 87.23 ± 2.01, with a mean of 89.89 ± 1.01 when compared to manually segmented structures on the CT. In comparison, the mean Dice coefficient for a GAN was 87.84 ± 1.23. Zhao et al. also reported high Dice scores for the pelvic region using an auto‐segmented ground truth CT, with a mean of 0.93 ± 0.03, whereas auto‐segmentation performed on CBCT images routinely failed to segment certain structures. When the same sCT auto‐segmentation was compared to a manually segmented ground truth dataset, the Dice score improved to 0.96 ± 0.04, suggesting that only small modifications are necessary. Dai et al. trained a separate segmentation network on sCT and CT images with respective manually generated ground truth labels for the thoracic region. The sCT and CT datasets performed comparably, with sCT Dice scores ranging from 0.63 ± 0.08 to 0.95 ± 0.01, compared to CT Dices scores of 0.73 ± 0.08 to 0.97 ± 0.01. Importantly, the mean Dice score for the CTV was 0.88 ± 0.03 for the CT and 0.83 ± 0.03 for the sCT.

sCT dosimetry

The dosimetric accuracy of sCT images is a clinically important endpoint for ART. Table 6 summarizes the number of investigations above and below a 95% GPR threshold for differing anatomical sites, gamma criteria, and radiation modalities.

TABLE 6

Studies reporting mean gamma pass rates for different anatomical regions and radiation modalities

	Head and neck			Pelvis			Abdomen			Breast			Lung
	γ ³	γ ²	γ ¹	γ ³	γ ²	γ ¹	γ ^3/2	γ ²	γ ¹	γ ³	γ ²	γ ¹	γ ³	γ ²	γ ¹
Photon	[2, 0]	[5, 0]	[2, 0]	[1, 0]	[2, 0]	–	[1, 0]	–	–	[1, 1]	[0, 1]	–	[3, 0]	[1, 2]	–
Proton	[2, 0]	[2, 0]	–	[1, 0]	[2, 0]	–	–	[1, 0]	–	–	–	–	[1, 0]	[0, 1]	–

Note: Mean gamma rates above 95% are considered clinically acceptable. Reporting format: [N > 95%, N < 95%] with N = number of evaluations. γ 3 = 3%/3 mm; γ 2 = 2%/2 mm; γ 1 = 1%/1 mm; γ 3/2 = 3%/2 mm. Light green = 1 validation; medium green = 2 validations; dark green = 3+ validations.

Studies reporting mean gamma pass rates for different anatomical regions and radiation modalities Note: Mean gamma rates above 95% are considered clinically acceptable. Reporting format: [N > 95%, N < 95%] with N = number of evaluations. γ 3 = 3%/3 mm; γ 2 = 2%/2 mm; γ 1 = 1%/1 mm; γ 3/2 = 3%/2 mm. Light green = 1 validation; medium green = 2 validations; dark green = 3+ validations. The HN site was the most investigated region for both photon and proton plans and recorded no rates below 95%. , , , , , , , Furthermore, it was the only region to be assessed under a 1%/1‐mm stringency and pass for photon plans. , Meanwhile, multiple studies showed the acceptability of proton plans in the HN region for both 2%/2‐ and 3%/3‐mm criteria. , The pelvic region was validated for photon , and proton plans, , showing acceptable pass rates for both 3%/3‐ and 2%/2‐mm criteria. A single study reported passing 3%/2‐mm GPR for the abdominal region for photon plans ; however, Liu et al. did conduct a DVH analysis on OAR and PTV volumes for pancreatic cancer and concluded that there was no significant difference between sCT and DIR CT plans. The abdominal region was investigated once for pediatric proton patients in Uh et al. with passing 2%/2‐mm rates. Photon breast cancer plans failed under the 2%/2‐mm criteria in Maspero et al. ; however, the more permissible 3%/3‐mm criteria passed in their investigation. Conversely, Dai et al. reported failing plans at 3%/3 mm. No proton dosimetric analyses were conducted for the breast region. The lung site was well validated for photon plans , , for a 3%/3‐mm criteria; however, two investigations noted failing 2%/2‐mm rates. A single study by Thummerer et al. investigated the lung site for proton ART and reported passing rates only for 3%/3‐mm criteria.

Projection domain corrections

A total of six projection domain–based CBCT correction methods are summarized in Table 4. Studies operating strictly in the projection domain attempted to approximate the scatter signal contained in raw projections by using either MC‐derived scatter maps, , or a CT‐prior‐based correction approach as the ground truth. , , Meanwhile, one study looked at predicting MC‐corrected CBCT images from uncorrected CBCT images. For all studies, U‐Net was used as training was performed using paired data. Nomura et al. utilized a U‐Net to learn how the scatter distribution within five simulated non‐anthropomorphic phantom projections generalized to patient projections simulated using MC. The DL approach was compared to an analytical kernel‐based scatter correction method fASKS and was shown to significantly improve the resulting MAE. Similarly, Rusanov et al. trained a network to learn the scatter distribution resulting from four anthropomorphic phantoms. When applied on real patient scans, the MAE improved compared to vendor reconstructions (74 vs. 117 HU). Residual learning of the scatter signal rather than the corrected projection was hypothesized by Nomura et al. to be a more efficacious training strategy, a fact later confirmed by Lalonde et al. in their patient‐based MC study (13.41 vs. 20.23 HU), and Rusanov et al. in their scatter correction study (74 vs. 77 HU). These authors also investigated the use of MSE and MAE loss functions, with Lalonde et al. and Rusanov et al. both reporting an improved MAE for the latter loss (13.41 vs. 15.48 HU and 74 vs. 86 HU, respectively). Nomura et al. concluded that MAE penalized anatomic regions more than MSE, which tended to penalize noisy regions primarily in air, thereby leading to more inaccurate scatter correction in anatomic regions. The a priori scatter correction technique is well validated in the CBCT literature , , , and has served as the ground truth for the Hansen et al. and Landry et al. studies investigating projection domain correction. The technique is predicated on simulating virtual projections of the planning CT through forward projection using the scanning geometry of the CBCT. CT projections are assumed to be scatter‐free and are used to estimate the scatter signal in raw CBCT projections by extracting the low frequency signal contained in the residual projection. In both studies, raw projections were synthesized into synthetic a priori corrected projections that, once reconstructed, deviated in MAE from the ground truth by 51 HU for Landry et al. and 48 HU for Hansen et al. Landry et al. further compared projection domain synthesis to two image domain synthesis approaches: transforming uncorrected CBCT images to a priori corrected CBCT images, and generating sCTs. In terms of MAE, the projection domain approach performed best (51‐HU projection vs. 88‐HU sCT and 58‐HU a priori image). However, in terms of proton dosimetry, the sCT produced the highest mean 2%/2‐mm GPR (96.1%), followed by synthetic a priori (95.3%), and projection domain synthesis (93.0%). Even though the ground truth was a priori corrected CBCTs, the sCT produced the best dosimetric agreement, likely due to the more uniform HU and smooth distribution unlike the other two approaches. Lalonde et al. and Jiang et al. used MC to create scatter‐free ground‐truth data for their image synthesis studies. Although Lalonde et al. trained a network to predict the scatter contribution in raw projections, Jiang et al. synthesized MC‐corrected CBCTs from input‐uncorrected CBCTs. As ground truth and input CBCT images show perfect alignment, a one‐to‐one mapping can be learned using U‐Net without suffering blurring and artifact‐related shortcomings common in sCT generation. Their proposed network outperformed the a priori , , correction method in terms of RMSE (18.8 vs. 30.3 HU). Lalonde et al. found good agreement between U‐Net‐corrected CBCTs and MC‐synthesized images for proton HN plans (2%/2‐mm gamma mean of 98.89%). When the same model was compared against real patient HN scans using the a priori correction as ground truth, the 2%/2‐mm GPR dropped to 78.15%. This result is lower than what was reported in Landry et al. for the same criteria (2%/2‐mm gamma > 85%), possibly because MC simulations do not model system realism to the same degree achieved in a priori corrections.

DISCUSSION

This review has sought to provide an in‐depth summary of the current state of the literature involved with CBCT‐based sCT generation and projection‐based scatter correction using DL. Studies from 1 January 2018 to 1 April 2022 were reviewed with a focus on DL methods and relevant clinical endpoints relating to image quality, dosimetry, and segmentation accuracy. The primary motivation for improving CBCT image quality is to provide accurate pretreatment anatomical information to facilitate online ART. By minimizing inter‐fractional anatomic uncertainty, clinical benefits accrue from the reduction of dose to OAR and escalation of dose to target volumes. , , , , , The following discussion aims to summarize best practices that may be of interest to DL practitioners and inform clinicians on the current state of CBCT‐based ART.

Recommendations for researchers

The literature summary showed that among studies that controlled for training data, cycle‐GAN generally outperformed GAN‐based approaches by 6.81% and U‐Net by 14.25% in terms of relative MAE improvement (see Figure 5). In terms of anatomical preservation, cycle‐GAN was explicitly compared to GAN models in Gao et al. and Liang et al. using anthropomorphic phantoms as true ground truth data. In both studies, GAN‐based sCT images failed to preserve the underlying anatomy. Importantly, phantom inserts were erased and false anatomic information such as the heart and bronchioles were superimposed in Gao et al., showing that GAN‐based synthesis produces the most likely anatomy based on the specific context of the region, rather than explicitly maintaining the structure from the input. Given that GANs are trained to approximate the probability density distribution of the target domain, without further constraints such as cycle consistency, this result is not surprising. U‐Net‐based synthesis relies on data alignment to learn a direct mapping using a hand‐crafted loss function. In the absence of perfectly paired data, the resulting sCT is blurred as boundary differences are averaged and perceptually unrealistic given the simplicity of the loss function. , , , Although the resulting sCT images may be viable for dose calculations, their MAE is generally higher than GAN‐based architectures and they may fall short in downstream tasks necessary for ART such as manual or automatic segmentation that requires sharp organ boundaries. One other unpaired translation network was investigated, ADN, which outperformed cycle‐GAN in terms of MAE (32.70 ± 7.26 vs. 42.04 ± 8.84 HU), and scored the highest percentage SSIM improvement out of studies not using low‐dose CBCTs as input (see Figure 10). ADN does not rely on cycle‐consistency to enforce anatomical information, rather, image content and style are disentangled using multiple encoders. A common content feature space is established between CT and CBCT images, whereas style embeddings relating to CBCT and CT images are separated. It then becomes possible to combine CT style with CBCT content. One advantage of the ADN network, and disentangled unsupervised image translation approaches in general, is that their outputs do not depend on the strict cycle‐consistency constraint, in turn producing more realistic images. Although cycle‐consistency helps maintain content information, it also encourages the preservation of some artifacts during sCT generation such that the backward cycle has useful prompts to accurately recreate the cycle images (which contain those artifacts). A recent work aimed to loosen the cycle loss constraint by replacing the pixel‐wise error between input and cycle images with another discriminator. The authors show that this strategy provides sufficient constraints for the generators to maintain the input structure while minimizing any residual prompts in the synthetic image. In the DL literature, many supervised and unsupervised image translation architectures exist, but only a small subset have been applied to medical data. A more complete review of these architectures can be found in Alotaibi et al. and Pang et al., with benchmarks of various networks on some of the most common datasets in Saxena et al. Interestingly, unpaired training configurations outperformed paired training for cycle‐GAN in terms of relative MAE improvements (see Figure 6; 55.98% vs. 47.61%, p = 0.16). However, the results narrowed when studies with similar training set sizes were compared, with paired implementations performing slightly better (53.65% vs. 51.83%, p = 0.29). Given that all studies used CBCT and CT data from the same patients, similar levels of variance, and therefore model performance, should exist whether paired or unpaired configurations were applied. However, it is anticipated that faster convergence could be achieved using paired data, as discriminators are better able to determine real from fake samples when anatomic position is controlled for. Hence, domain‐specific features such as artifacts may be easier to identify when real and fake data distributions are structurally similar. Similarly, no clear advantage was observed for cycle‐GAN studies utilizing patch‐based 3D networks over 2D variants (52.08% 2D vs. 49.83% 3D, p = 0.41). It is suspected that the benefits of 3D feature representations are somewhat negated by two factors: (1) 3D networks contain more trainable parameters than their 2D counterparts; hence, more training data is required to avoid overfitting and to capture the increased data complexity. (2) Creating patch‐based volumes destroys the global feature representations otherwise available at the encoder‐decoder bottleneck, thereby losing access to important contextual information for modeling long range dependencies. To overcome these drawbacks, the GAN used in Dahiya et al. used full 3D image volumes and was trained on a large cohort of 140 patients. Accordingly, the authors achieved the second‐best relative MAE improvement of 81.99%, likely due to addressing the abovementioned concerns. Whether using paired or unpaired data, preprocessing images using intra‐patient RR is always beneficial as it allows excess air regions to be truncated. It is hypothesized that network performance may further increase with inter‐patient RR as the extent of each patient's anatomy is grouped to a common center, similar to the improvement in results in Uh et al. after pediatric body normalization (47 ± 7 vs. 60 ± 7 HU, p < 0.01). The only study to compare the use of DIR and RR on training a cycle‐GAN concluded that minimal benefits accrue in terms of MAE improvement (56.89 ± 13.84 HU DIR vs. 58.45 ± 13.88 HU RR); however, visual inspection showed noticeably less noise, fewer motion artifacts, and superior boundary preservation for the model trained with DIR pre‐processing. Finally, applying normalization on a patient‐wise level is recommended over slice‐wise normalization, as demonstrated by Dong et al. in which the latter resulted in intensity discontinuity between slices. Data augmentation is used to prevent overfitting and improve model generalization when limited data is available. One important augmentation strategy is the addition of Gaussian noise during optimization, which has been shown to improve the learning of salient features. A further benefit of noise injection relates closely to the phenomenon described earlier regarding the preservation of artifacts during the forward cycle in cycle‐GAN. Bashkirova et al. showed how generators imbed a low amplitude structured noise in synthetic images that is used during the backward cycle as a prompt to reproduce the input domain more accurately. The addition of this structured noise, which is imperceptible by humans, is like cheating and prevents the generator from learning the optimal translation parameters. The authors discovered that the addition of low amplitude Gaussian noise during training inhibits the generators capacity to cheat, encourages the learning of more robust features, leads to visually more realistic images, and reduces model sensitivity to noise perturbations by a factor of 6. As a result, their cycle‐GAN trained with noise augmentation produced synthetic images with an MSE 31.86% lower than the baseline. Virtually no difference in percent MAE improvement was observed between studies utilizing the standard cycle‐GAN loss compared with extended loss configurations (49.08 ± 21.85‐HU standard vs. 49.90 ± 14.37‐HU extended, p = 0.35). However, for studies investigating different loss functions, perceptual and MAE improvements in image quality were reported for identity, SSIM, feature matching, gradient, synthetic, histogram, and perceptual losses. , , , , Furthermore, the SSIM—a perceptual image quality metric—improved by a greater margin in studies using additional loss functions over studies using baseline losses (10.5 ± 5.95% vs. 12.8 ± 5.73%, p = 0.35). As a consequence for ART, automatic segmentation models (pre‐)trained on CT images should perform better on sCT images that are perceptually more similar to CT images. Hence, more time‐effective ART protocols requiring fewer manual corrections could be achieved if perceptual differences are minimized. For this reason, it is important to adopt more sophisticated perceptual image quality metrics alongside SSIM, which has been shown to fail under certain circumstances. The Fréchet inception distance (FID) is the most common metric used in the DL community to measure image quality in the absence of a ground truth image. FID uses a pretrained inception v3 network to extract features from real and generated data. Statistical measures are then compared between deep feature maps to assess image similarity, a technique shown to correlate well with the human visual system. Another benefit of FID is that data alignment is not required, thereby removing the bias entrenched in MAE assessments that are highly dependent on anatomical correspondence. Newer and more robust deep‐feature‐based perceptual metrics include LPIPS (Learned Perceptual Image Patch Similarity) and DISTS (Deep Image Structure and Texture Similarity), both of which are commonly used to compare generative model image quality. Hence, using these metrics as optimization criteria or merely to compare models can aid in the selection of the most suitable sCT data in terms of CT/sCT feature similarity for downstream ART auto‐segmentation tasks. One shortcoming of fully convolutional networks is their inability to model long range dependencies well. This is due to the use of small convolution kernels that can only “see” a portion of the image at any given time. The use of inception blocks, as in Tien et al. to merge smaller and larger kernels, is one method to capture features at different resolutions. Alternatively, global attention mechanisms such as attention gating used in Liu et al. aim to downweigh less relevant spatial regions. , Recently, the emergence of vision transformers (ViT) has challenged the notion that convolutions are the best way to drive learning‐based computer vision tasks. Based on patch‐tokenization of the image, ViTs are capable of modeling global contextual information, in effect learning how each patch relates to every other patch directly. The downside, however, is that local information is coarsely modeled, which motivated researchers to create hybrid models that utilize ViTs to model global information in the encoder wing, while synthesizing high resolution outputs using a CNN decoder wing. Currently, segmentation‐based networks such as Trans‐U‐Net use this configuration to outperform comparable pure CNN architectures. The first hybrid image synthesis architecture, ResViT, utilizes a series of ViTs at the encoder‐decoder bottleneck to better aggregate global features. The authors noted that SSIM and PSNR for ResViT (0.931 ± 0.009 and 28.45 ± 1.35) improved over Trans‐U‐Net (0.914 ± 0.009 and 27.76 ± 1.03), attention U‐Net (0.913 ± 0.004 and 27.80 ± 0.63), and pix2pix GAN (0.898 ± 0.004 and 26.53 ± 0.45) for an MRI to CT synthesis task. Hybrid architecture approaches to image synthesis offer a new and exciting research direction; however, issues around higher ViT training data requirements still need to be addressed. That learning‐based models benefit from increased trained set sizes is well known. As evident in Figure 7, a small but positive relationship was found between training cohort size and relative improvement in MAE. The relationship was poorly modeled by linear regression; however, a moderate upward trend is seen from 11 to 46 patients, which then forms a plateau as further increases in cohort size offer diminished, or no benefit. Comparable MAE reductions, around ∼70%, were achieved using anywhere from 15 to 46 patients; however, less variance in model performance was observed for higher cohort sizes indicating a stronger relationship. The two largest improvements in MAE used 90 and 140 patients, respectively, showing that much larger training sets can push the SoTA. Although reliance only upon data size is not guaranteed for success, as seen by studies that showed comparable improvements using sub 40 and greater than 120 patients. In general, the interplay between model performance and training set size follows this pattern across different medical tasks such as segmentation , and classification102,103: rapid improvements with high variance initially, followed by saturation with any further increases in training set size. The optimal number of training cases is unique for every class of problem and dependent on the quality of the data itself. Nevertheless, given the high cost of medical data collection, suitable sCT image quality can be achieved using anywhere between 15 and 46 patients, with preference for higher cohort sizes if reasonably possible. Table 5 shows the thoracic, pelvic, and HN regions were widely investigated and showed similar levels of MAE improvement (57.54 ± 12.65%, 57.97 ± 19.68%, 58.67 ± 10.75% respectively). These may be considered well validated in the context of image quality. However, studies investigating the abdominal region accounted for only 7% of all sites and also showed the lowest relative MAE improvement (41.33 ± 19.51%) while using less than half of the training data of other sites. Given the large extent of motion artifacts present in abdominal scans, larger training cohorts and more sophisticated translation approaches such as in Dahiya et al. may be necessary. There, CBCT‐specific artifacts were infused into CT images using a physics‐based augmentation to create perfectly aligned training data which aided sCT synthesis for the artifact‐prone thoracic region. Beyond sCT generation, projection domain corrections warrant further attention given the impressive results in Landry et al. which showed that DL‐driven a priori corrected images possessed lower MAE than sCT images (51 HU a priori vs. 88 HU sCT), and comparable proton dose accuracy (96.1% sCT vs. 93.0% a priori). Performing corrections in the projection domain has several advantages over the image domain: First, scatter and associated intensity deviations are well localized in each projection, whereas these errors manifest non‐trivially in the image domain making them more difficult to learn. Thus, projection data compliments how convolution operations extract features locally. Second, projection data is far more numerous than reconstructed data for a given patient—a beneficial characteristic for DL. Unpaired translation networks have not yet been investigated in the projection domain. The a priori correction currently requires well‐aligned CT and CBCT data to mitigate high‐frequency errors during reconstruction. Alternatively, the use of unpaired data with cycle‐GAN mitigates the need to perform any smoothing operations during a priori correction, allowing the network to learn both low and high frequency sources of intensity errors—a characteristic not possible under the original a priori implementation. , , , The use of DL techniques for CT reconstruction has predominantly involved sparse‐view and low‐dose acquisitions as there is no ideal tomographic ground truth for learning strategies. , , , , , Iterative reconstruction (IR) applications utilizing DL can be categorized into “plug‐and‐play” and “unrolling” methods. Plug‐and‐play methods utilize a CNN that is first trained in the image domain to act as a regularizer during IR. Unrolling methods aim to model the various components of a reconstruction algorithm using convolutional and fully connected networks (FCN). For example, Wurfl et al. explicitly modeled the discrete Feldkamp–Davis–Kress (FDK) algorithm using convolutional and neuronal neural network components. Direct reconstruction using DL has been achieved by utilizing FCN, or a combination of CNN and FCNs. , In these latter approaches, raw data is first preprocessed with CNNs then reconstructed by utilizing an FCN that connects each pixel in the measured domain to all voxels in the image domain to learn a direct domain transformation mapping. For example, Li et al. performed reconstruction directly on sinogram data previously preprocessed by a CNN. Specifically for cone‐beam reconstruction, global FCN mapping showed to be computationally expensive; hence, Lu et al. developed a geometry‐guided framework containing an array of much smaller FCNs, each of which connects a pixel in the projection domain to a beamlet in the image domain based on the specific beamlet geometry incident from the X‐ray source. Thus, each beamlet traversing the object is modeled by a small FCN. Chen et al. showed that a pretrained natural image denoising CNN could denoise real cone‐beam projections. The CNN replaced classical hand‐crafted regularization terms in statistical‐IR reconstruction and resulted in reduced artifacts compared to total‐variation and FDK reconstructions. Meanwhile Han et al. proposed a novel reconstruction method using differentiated backprojection (DBP). Projection data is first transformed to a DBP form, with coronal and sagittal views fed into a CNN that learns to solve a Hilbert transform deconvolution task. The output consists of coronal and sagittal reconstructed images that exhibit higher image quality than FDK and CNN‐aided FDK reconstructions. Adhering to good practices in the reporting of results will help standardize the literature and make comparisons more consistent. Dose distribution comparisons of clinical plans between ground truth and corrected CBCTs should be performed using the recommendations defined by regulatory bodies such as AAPM (American Association of Physicists in Medicine). The latest AAPM Task Group 218 guidelines suggest using the 3%/2‐mm GPR criteria, calculated using a 10% low‐dose threshold for comparing dose distributions. Evaluating image quality is typically assessed using MAE; however, authors are sometimes ambiguous on how the metric is calculated. Most authors report MAE over the entire image volume, yet, these approaches can be biased against volumes with large amounts of air that artificially reduce the MAE. Recently, some authors have calculated MAE within the patient body contour, , , , , , , , , , whereas others went a step further and found the MAE for specific anatomic regions such as bone or soft tissue. , We recommend good practice in data reporting by defining all pixel‐wise metrics within the patient body contour to eliminate bias. Furthermore, reporting uncorrected CBCT image quality is necessary to gauge the relative improvement. Similar recommendations were recently echoed for SSIM: The authors recommended reporting the average SSIM within the patient body contour and ensuring that an appropriate dynamic range is used. As previously discussed, the FID is a popular image quality metric used in assessing generative models that lack a ground truth. The distance metric compares the distribution of deep features between two datasets irrespective of their pixel‐wise alignment, thereby addressing the inherent flaw of using MAE.

Recommendations for clinicians

A recent review of AI applications in radiotherapy listed sCT generation as one of the top three most popular use cases—and provided practical steps in commissioning, clinical implementation, and quality assurance. A best practice workflow for AI, and sCT generation specifically, is many years away given the rapid pace of technical progress; however, clinicians have begun investigating novel ways of ensuring quality control (QC). For example, as suggested by Thummerer et al. and shown in Oria et al., proton range probing fields can be used as a QC tool by comparing the difference between simulated and CT/sCT mean relative range errors in retrospective patient scans. Hence, in vivo data may be preferable to using conventional image quality phantoms for QA/QC applications as training data used for sCT generation does not contain any phantoms. The preferred method for implementing CBCT‐based dose monitoring protocols has relied on DIR of the planning CT to the CBCT. , , With the emergence of sCT generation, the authors have demonstrated better agreement in terms of MAE , , and anatomical accuracy over using deformed planning CTs, whilst maintaining comparable dose accuracy for proton and photon plans. , , Hence, the use of sCT for dose monitoring will enhance the accuracy of research conclusions and result in improved decision‐making for both online and offline ART. The acceptance criteria outlined in AAPM‐TG 218 for tolerance and action limits are defined as greater than 95% and less than 90% for a 3%/2‐mm GPR criterion, respectively. In adopting these thresholds, the current literature has thoroughly demonstrated adequate sCT dose accuracy for photon and proton plans in the HN region. , , , , , , , (see Table 6) Likewise, multiple‐photon , and proton , studies in the pelvic region demonstrated pass rated above 95% for the stricter 2%/2‐mm criterion. The abdominal, breast, and lung regions were less researched. Proton and photon abdominal plans were validated once, and a single‐photon lung plan was validated for the stricter 2%/2‐mm criteria. However, no study has investigated sCT accuracy for proton breast plans, whereas the single study that analyzed proton lung plans failed to demonstrate sufficient dosimetric accuracy. Automatic segmentation is an important intermediary step toward the development of online CBCT‐guided ART workflows. Although not the focus of this review, several authors did analyze the performance of sCTs in auto‐segmentation tasks. The suitability of sCT images for autosegmentation was compared to CT images in the thoracic region by training a separate model for each dataset. Dice scores indicated comparable performance on both datasets, with CT images retaining slightly higher mean CTV overlap. Generally, good auto‐segmentation performance is achieved if results are as good or better than interobserver variability, which was shown to range from 0.8 to 0.99 for the pelvic region. Consequently, Sun et al. and Zhao et al. demonstrated Dice score agreement greater than 0.89 for the pelvic region. , Recently, segmentation‐specific studies have compared the performance of CBCT‐derived sCT images for abdominal segmentations, reporting Dice scores above 0.8. , The validation of sCT‐based auto‐segmentation for all anatomic regions is an important step toward online ART. The nascent field of sCT generation is attracting attention from vendors, with Elekta releasing a research version of their Advanced Medical Image Registration Engine (ADMIRE) (Elekta AB, Sweden) software capable of producing sCT images using their implementation of cycle‐GAN. The model must be trained locally to adhere to consistent CT acquisition parameters. To this end, the recent retrospective study by Eckl et al. demonstrated the benefits of sCT‐driven ART using ADMIRE for stereotactic prostate cancer patients, showing statistically significant improvements in target coverage and organ sparing. Plan adaption varied from 2.6 ± 0.3 to 19.4 ± 4 min, depending on the ART protocol. Alternatively, the specialized ART module Ethos released by Varian (Varian Medical Systems, Palo Alto, CA, USA) does not generate sCT for replanning but utilizes DL for rapid contour generation. A recent analysis showed that Ethos could improve CTV and PTV coverage whilst significantly reducing dose to all OAR in prostate cancer patients, with an average clinical treatment time of 19 min. Looking forward, one may envision several DL models that make up the backbone of an ART engine that can quickly and accurately generate sCT images, automatically contour structures, predict new optimal dose distributions, and conduct QA assessments with minimal manual intervention.

CONCLUSION

Methods for improving CBCT image quality using DL have been reviewed, with emphasis placed on exploring the technical aspect of the current state of the art. Literature summaries and recommendations for DL practitioners and clinicians alike are made, ensuring that best practices are established in model development and clinical deployment moving forward. CBCT‐based sCTs were shown to accurately reflect real CT density information for the HN, pelvic, and thoracic regions, whereas the abdominal site received less attention. Cycle‐GAN models, on average, outperformed both U‐Net and GAN approaches in terms of image quality and anatomical preservation. Dosimetric performance of sCTs was thoroughly validated in the pelvic, and HN regions for both photon and proton plans, whereas other regions were less researched. Preliminary auto‐segmentation results demonstrate comparable performance for CT and sCT datasets alike. Looking ahead, large multicenter studies are needed to report their firsthand experiences implementing ART workflows, which include sCT generation, auto‐segmentation, plan adaption, and QA to help inform future protocols and highlight weaknesses.

CONFLICT OF INTEREST

The authors have no conflicts of interest to disclose.

86 in total

1. A preliminary study of using a deep convolution neural network to generate synthesized CT images based on CBCT for adaptive radiotherapy of nasopharyngeal carcinoma.

Authors: Yinghui Li; Jinhan Zhu; Zhibin Liu; Jianjian Teng; Qiuying Xie; Liwen Zhang; Xiaowei Liu; Jinping Shi; Lixin Chen
Journal: Phys Med Biol Date: 2019-07-16 Impact factor: 3.609

2. CBCT correction using a cycle-consistent generative adversarial network and unpaired training to enable photon and proton dose calculation.

Authors: Christopher Kurz; Matteo Maspero; Mark H F Savenije; Guillaume Landry; Florian Kamp; Marco Pinto; Minglun Li; Katia Parodi; Claus Belka; Cornelis A T van den Berg
Journal: Phys Med Biol Date: 2019-11-15 Impact factor: 3.609

Review 1. Deep learning methods for enhancing cone-beam CT image quality toward adaptive radiation therapy: A systematic review.

Authors: Branimir Rusanov; Ghulam Mubashar Hassan; Mark Reynolds; Mahsheed Sabet; Jake Kendrick; Pejman Rowshanfarzad; Martin Ebert
Journal: Med Phys Date: 2022-07-18 Impact factor: 4.506

1 in total

INTRODUCTION

BACKGROUND

Deep learning

The convolution layer

Loss function and optimization

U‐Net

Generative adversarial networks

Cycle‐GAN

Evaluation metrics

METHODS

RESULTS

Study identification

sCT generation

Network architectures

Training configuration

Preprocessing

Loss metrics

Network blocks

Cohort size

Augmentation

Model generalizability

Segmentation accuracy

sCT dosimetry

Projection domain corrections

DISCUSSION

Recommendations for researchers

Recommendations for clinicians

CONCLUSION

CONFLICT OF INTEREST

Review 8. Adaptive radiotherapy for head and neck cancer.

Review 1. Deep learning methods for enhancing cone-beam CT image quality toward adaptive radiation therapy: A systematic review.